Gia Duong Duc Minh
Gia Duong Duc Minh

Reputation: 1329

Split Spark DataFrame into parts

I have a table of distinct users, which has 400,000 users. I would like to split it into 4 parts, and expected each user located in one part only.

Here is my code:

val numPart = 4
val size = 1.0 / numPart
val nsizes = Array.fill(numPart)(size)
val data = userList.randomSplit(nsizes)

Then I write each data(i), i from 0 to 3, into parquet files. Select the directory, group by user id and count by part, there are some users that located in two or more parts.

I still have no idea why?

Upvotes: 2

Views: 985

Answers (2)

Gia Duong Duc Minh
Gia Duong Duc Minh

Reputation: 1329

I have found the solution: cache the DataFrame before you split it.

Should be

val data = userList.cache().randomSplit(nsizes)

Still have no idea why. My guess, each time the randomSplit function "fill" the data, it reads records from userList which is re-evaluate from the parquet file(s), and give a different order of rows, that's why some users are lost and some users are duplicated.

That's what I thought. If some one have any answer or explanation, I will update.

References:

  1. (Why) do we need to call cache or persist on a RDD
  2. https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-caching.html
  3. http://159.203.217.164/using-sparks-cache-for-correctness-not-just-performance/

Upvotes: 1

Assaf Mendelson
Assaf Mendelson

Reputation: 12991

If your goal is to split it to different files you can use the functions.hash to calculate a hash, then mod 4 to get a number between 0 to 4 and when you write the parquet use partitionBy which would create a directory for each of the 4 values.

Upvotes: 0

Related Questions