Reputation: 6377
I need to partition my dataframe by column. I know that it is possible for saving in separate files. But I need to partition for further processing (I need to sort partitions in a certain order and apply udf to the ordered partitions).
My code is:
df = spark.createDataFrame([(2,), (1,), (2,), (1,), (2,)], ("name",)) \
.repartitionByRange(2, "name") \
.rdd.glom().collect()
print(df)
# [[Row(name=2), Row(name=1), Row(name=2), Row(name=1), Row(name=2)], []]
I need to get something like that:
[[(2,), (2,), (2,)], [(1,), (1,)]]
Upvotes: 1
Views: 495
Reputation: 42422
You can use repartition
instead of repartitionByRange
:
df = spark.createDataFrame([(2,), (1,), (2,), (1,), (2,)], ("name",)) \
.repartition(2, "name") \
.rdd.glom().collect()
print(df)
# [[Row(name=2), Row(name=2), Row(name=2)], [Row(name=1), Row(name=1)]]
repartitionByRange
uses sampling to estimate ranges and could result in errors as you have observed.
Upvotes: 1