Is there a way to read all the files under a parquet partition onto a single spark partition?

Question

The data is stored in parquet format. The parquet files are partitioned based off a partition key column (hash value of the user id column)

userData/
    partitionKey=1/
        part-00044-cf737804-90ea-4c37-94f8-9aa016f6953a.c000.snappy.parquet
        part-00044-cf737804-90ea-4c37-94f8-9aa016f6953b.c000.snappy.parquet
    partitionKey=2/
        part-00059-cf737804-90ea-4c37-94f8-9aa016f6953a.c000.snappy.parquet
    partitionKey=3/
        part-00002-cf737804-90ea-4c37-94f8-9aa016f6953a.c000.snappy.parquet

Given the partitioning scheme, we know:

All data for a given user would fall under the same partition
A partition can have more than 1 user's data

While reading in the data, I want all the data of 1 user to fall into the same spark partition. A single spark partition can have more than 1 users, but it should have all the rows for all those users.

Currently, what I use is: SparkSession.read.parquet("../userData").repartition(200, col("UserId"))

(also tried partitionBy with custom partitioner; The sequence of operations: DataFrame -> RDD -> KeyedRDD -> partitionBy -> RDD -> DataFrame; Before the partitionBy, there is a deserialize to object step that explodes the shuffle write)

Is there a way to avoid the repartition and leverage the input folder structure to place a user's data on a single partition?

Is there a way to read all the files under a parquet partition onto a single spark partition?

Answers (1)

Related Questions