How to partition dataframe by column in pyspark for further processing?

Question

I need to partition my dataframe by column. I know that it is possible for saving in separate files. But I need to partition for further processing (I need to sort partitions in a certain order and apply udf to the ordered partitions).

My code is:

df = spark.createDataFrame([(2,), (1,), (2,), (1,), (2,)], ("name",)) \
    .repartitionByRange(2, "name") \
    .rdd.glom().collect()
print(df)

# [[Row(name=2), Row(name=1), Row(name=2), Row(name=1), Row(name=2)], []]

I need to get something like that:

[[(2,), (2,), (2,)], [(1,), (1,)]]

mck · Accepted Answer

You can use repartition instead of repartitionByRange:

df = spark.createDataFrame([(2,), (1,), (2,), (1,), (2,)], ("name",)) \
    .repartition(2, "name") \
    .rdd.glom().collect()

print(df)
# [[Row(name=2), Row(name=2), Row(name=2)], [Row(name=1), Row(name=1)]]

repartitionByRange uses sampling to estimate ranges and could result in errors as you have observed.

How to partition dataframe by column in pyspark for further processing?

Answers (1)

Related Questions