shapiy
shapiy

Reputation: 1155

How DataFrame API depends on RDDs in Spark?

Some sources, like this Keynote: Spark 2.0 talk by Mathei Zaharia, mention that Spark DataFrames are built on top of RDDs. I have found some mentions on RDDs in the DataFrame class (in Spark 2.0 I'd have to look at DataSet); but still I have very limited understanding of how these two APIs are bound together behind the scenes.

Can someone explain how DataFrames extend RDDs if they do?

Upvotes: 2

Views: 738

Answers (1)

shapiy
shapiy

Reputation: 1155

According to the DataBricks article Deep Dive into Spark SQL’s Catalyst Optimizer (see Using Catalyst in Spark SQL), RDDs are elements of a Physical Plan built by Catalyst. So, we describe queries in terms of DataFrames, but in the end, Spark operates on RDDs.

Catalyst workflow

Also, you can view the Physical plan of your query by using EXPLAIN instruction.

//  Prints the physical plan to the console for debugging purpose
auction.select("auctionid").distinct.explain()

// == Physical Plan ==
// Distinct false
// Exchange (HashPartitioning [auctionid#0], 200)
//  Distinct true
//   Project [auctionid#0]
 //   PhysicalRDD   //[auctionid#0,bid#1,bidtime#2,bidder#3,bidderrate#4,openbid#5,price#6,item#7,daystolive#8], MapPartitionsRDD[11] at mapPartitions at ExistingRDD.scala:37

Upvotes: 5

Related Questions