Reputation: 2017
I have setup a Spark 1.3.1 application that collects event data. One of the attributes is a timestamp called 'occurredAt'. Im intending to partition the event data into parquet files on a filestore and in accordance with the documentation (https://spark.apache.org/docs/1.3.1/sql-programming-guide.html#partition-discovery) it indicates that time based values are not supported only string and int, so i've split the date into Year, Month, Day values and partitioned as follows
events
|---occurredAtYear=2015
| |---occurredAtMonth=07
| | |---occurredAtDay=16
| | | |---<parquet-files>
...
I then load the parquet file from the root path /events
sqlContext.parquetFile('/var/tmp/events')
Documentation says:
'Spark SQL will automatically extract the partitioning information from the paths'
However my query
SELECT * FROM events where occurredAtYear=2015
Fails miserably saying spark cannot resolve 'occurredAtYear'
I can see the schema for all other aspects of the event and can do queries on those attributes, but printSchema does not list occurredAtYear/Month/Day on the schema at all? What am I missing to get partitioning working appropriately.
Cheers
Upvotes: 1
Views: 4287
Reputation: 2017
So it turns out I was following the instructions too precisely, I was actually writing the parquet files out to
/var/tmp/occurredAtYear=2015/occurredAtMonth=07/occurredAtDay=16/data.parquet
The 'data.parquet' was additionally creating a further directory with parquet files underneath, I should have been saving the parquet file to
/var/tmp/occurredAtYear=2015/occurredAtMonth=07/occurredAtDay=16
All works now with the schema being discovered correctly.
Upvotes: 8