Reputation:
I went throguh the link What's the difference between RDD and Dataframe in Spark?
Is it mandatory to create RDD for doing the operation, we can start working with data-frame. is there any advantage for RDD over Dataframe
Can we run Pandas,numpy data-frame functionality on spark. For numpy the np.where and for pandas like df.groupby[''].agg()
Upvotes: 0
Views: 265
Reputation: 2178
For structured data you needn't use RDD. You can use Dataframe or Dataset for Scala and Java. For Python you need to use Dataframe. Please see official guide.
For unstructured data you will still need to use RDD.
Dataframe generally provides the fastest performance (as per Mathei's book).
The dataframe syntax (using Spark SQL) can support almost all of SQL like functions. You can also use Pandas, please see Pandas guide.
Project Koala enables using panda's syntax on Spark. I will prefer using this over Pandas. Here is the Koala guide.
Upvotes: 1