Victor Rodriguez
Victor Rodriguez

Reputation: 85

How can I extract information from parquet files with Spark/PySpark?

I have to read in N parquet files, sort all the data by a particular column, and then write out the sorted data in N parquet files. While I'm processing this data, I also have to produce an index that will later be used to optimize the access to the data in these files. The index will also be written as a parquet file.

For the sake of example, let's say that the data represents grocery store transactions and we want to create an index by product to transaction so that we can quickly know which transactions have cottage cheese, for example, without having to scan all N parquet files.

I'm pretty sure I know how to do the first part, but I'm struggling with how to extract and tally the data for the index while reading in the N parquet files.

For the moment, I'm using PySpark locally on my box, but this solution will eventually run on AWS, probably in AWS Glue.

Any suggestions on how to create the index would be greatly appreciated.

Upvotes: 0

Views: 884

Answers (1)

Matt Andruff
Matt Andruff

Reputation: 5135

This is already built into spark SQL. In SQL use "distribute by" or pyspark: paritionBy before writing and it will group the data as you wish on your behalf. Even if you don't use a partitioning strategy Parquet has predicate pushdown that does lower level filtering. (Actually if you are using AWS, you likely don't want to use partitioning and should stick with large files that use predicate pushdown. Specifically because s3 scanning of directories is slow and should be avoided.)

Basically, great idea, but this is already in place.

Upvotes: 0

Related Questions