Spark SQL queries on Partitioned Data

Question

I have setup a Spark 1.3.1 application that collects event data. One of the attributes is a timestamp called 'occurredAt'. Im intending to partition the event data into parquet files on a filestore and in accordance with the documentation (https://spark.apache.org/docs/1.3.1/sql-programming-guide.html#partition-discovery) it indicates that time based values are not supported only string and int, so i've split the date into Year, Month, Day values and partitioned as follows

events
  |---occurredAtYear=2015
  |   |---occurredAtMonth=07
  |   |   |---occurredAtDay=16
  |   |   |   |---
  ...

I then load the parquet file from the root path /events

sqlContext.parquetFile('/var/tmp/events')

Documentation says:

'Spark SQL will automatically extract the partitioning information from the paths'

However my query

SELECT * FROM events where occurredAtYear=2015

Fails miserably saying spark cannot resolve 'occurredAtYear'

I can see the schema for all other aspects of the event and can do queries on those attributes, but printSchema does not list occurredAtYear/Month/Day on the schema at all? What am I missing to get partitioning working appropriately.

Cheers

vcetinick · Accepted Answer

So it turns out I was following the instructions too precisely, I was actually writing the parquet files out to

/var/tmp/occurredAtYear=2015/occurredAtMonth=07/occurredAtDay=16/data.parquet

The 'data.parquet' was additionally creating a further directory with parquet files underneath, I should have been saving the parquet file to

/var/tmp/occurredAtYear=2015/occurredAtMonth=07/occurredAtDay=16

All works now with the schema being discovered correctly.

Spark SQL queries on Partitioned Data

Answers (1)

Related Questions