xv70
xv70

Reputation: 982

How do Spark RDDs and DataFrames differ in how they load data into memory?

RDD's are useful because they allow users to process data at the "row" level (or json single object, etc.), without having to load all data into memory. The driver figures out how to distribute the distributed data (or pointers to it) into the workers, and each partition happily executes the code, per line / row / object. Then without having to collect the data in the driver, I can save the result of each partition into a separate text file.

DataFrames. How does this work? I suspect it is not the same because I can process a month worth of server logs just fine with a small 8 node cluster using RDD's, but as soon as I try to even load the distributed data with sql_context(spark_context).sql.read.json(s3path) into a DataFrame it spits all sorts of out of memory errors and the job aborts. The data set is exactly the same as the one the RDD performs properly, same cluster, same time-period.

Is there a difference in the way RDD's and DataFrames handles memory loading, in a sense that might explain my results? Please help me understand the differences between RDD"s and DataFRames that might be driving these results. Thanks.

Upvotes: 3

Views: 1648

Answers (1)

Thiago Baldim
Thiago Baldim

Reputation: 7732

This a point of understanding with that, and I had this problem too few weeks ago. the function that you are loading:

sql_context(spark_context).sql.read.json(s3path)

This code according to the documentation if you don't give the schema the spark will go really deep in your json to find the Types to build the RDD. This works like the inferSchema in load CSV of databricks library.

So what I can recommend you to do is:

  • Adding the Schema of the Json with the sql.types
  • Or, as I know that is a really overhead to do this, use this method and after this use toDF()

Well, this is the possible problem that you are facing. I didn't have the OOM issue, but it was taking minutes to load something that with RDD is really fast.

Upvotes: 6

Related Questions