Reputation: 20485
Ok so after getting exceptions about not being able to write keys into a parquet file via spark I looked into the API and found only this.
public class ParquetOutputFormat<T> extends FileOutputFormat<Void, T> {....
(My assumption could be wrong =D, and there might be another API somewhere. )
Ok This makes some warped sense, after all you can project/restrict the data as it is materialising out of the container file. However, just to be on the safe side. A Parquet file does not have the notion of a sequence file's "key" value , right ?
I find this a bit odd, the Hadoop infrastructure builds around the fact that a sequence file may have a key. And I assume this key is used liberally to partition data into blocks for locality (not at the HDFS level ofc) ? Spark has a lot of API calls that work with the code to do reductions and join's etc. Now I have to do extra step to map the keys out from the body of the materialised object. Weird.
So any good reasons why a key is not a first class citizen in the parquet world ?
Upvotes: 4
Views: 2469
Reputation: 117
You are correct. Parquet file is not a key/value file format. It's a columnar format. Your "key" can be a specific column from your table. But it's not like HBase where you have a real key concept. Parquet is not a sequence file.
Upvotes: 4