How can I extract information from parquet files with Spark/PySpark?

Question

I have to read in N parquet files, sort all the data by a particular column, and then write out the sorted data in N parquet files. While I'm processing this data, I also have to produce an index that will later be used to optimize the access to the data in these files. The index will also be written as a parquet file.

For the sake of example, let's say that the data represents grocery store transactions and we want to create an index by product to transaction so that we can quickly know which transactions have cottage cheese, for example, without having to scan all N parquet files.

I'm pretty sure I know how to do the first part, but I'm struggling with how to extract and tally the data for the index while reading in the N parquet files.

For the moment, I'm using PySpark locally on my box, but this solution will eventually run on AWS, probably in AWS Glue.

Any suggestions on how to create the index would be greatly appreciated.

How can I extract information from parquet files with Spark/PySpark?

Answers (1)

Related Questions