Reputation: 1329
I have a table of distinct users, which has 400,000 users. I would like to split it into 4 parts, and expected each user located in one part only.
Here is my code:
val numPart = 4
val size = 1.0 / numPart
val nsizes = Array.fill(numPart)(size)
val data = userList.randomSplit(nsizes)
Then I write each data(i)
, i from 0 to 3, into parquet files. Select the directory, group by user id and count by part, there are some users that located in two or more parts.
I still have no idea why?
Upvotes: 2
Views: 985
Reputation: 1329
I have found the solution: cache the DataFrame before you split it.
Should be
val data = userList.cache().randomSplit(nsizes)
Still have no idea why. My guess, each time the randomSplit
function "fill" the data
, it reads records from userList
which is re-evaluate from the parquet file(s), and give a different order of rows, that's why some users are lost and some users are duplicated.
That's what I thought. If some one have any answer or explanation, I will update.
References:
Upvotes: 1
Reputation: 12991
If your goal is to split it to different files you can use the functions.hash to calculate a hash, then mod 4 to get a number between 0 to 4 and when you write the parquet use partitionBy which would create a directory for each of the 4 values.
Upvotes: 0