Reputation: 6994
I found this strange problem while reading a spark dataframe. I repartitioned the dataframe into 50k partitions. However, when I read and perform a count action on the dataframe I found that the underlying rdd has only 2143 partitions when I am using spark 2.0.
So I went to the path where I saved the repartitioned data and found that
hfs -ls /repartitionedData/ | wc -l
50476
So it has created 50k paritions while saving the data.
However with spark 2.0,
val d = spark.read.parquet("repartitionedData")
d.rdd.getNumPartitions
res4: Int = 2143
But with spark 1.5,
val d = spark.read.parquet("repartitionedData")
d.rdd.partitions.length
res4: Int = 50474
Can someone help me with this?
Upvotes: 4
Views: 1245
Reputation: 28392
It's not that you are losing data, Spark only change the number of partitions. FileSourceStrategy combines the parquet files into fewer partitions and also reorders the data.
This is something that changed when Spark got upgraded into version 2.0
. You can find a somewhat related bug report here.
Upvotes: 6