Newbie
Newbie

Reputation: 741

Efficient way to convert Dataframe to RDD in Scala/SPARK?

I have a dataFrame = [CUSTOMER_ID ,itemType, eventTimeStamp, valueType] which I convert to RDD[(String, (String, String, Map[String, Int]))] by doing the following:

 val tempFile = result.map( {
     r => {
         val customerId = r.getAs[String]( "CUSTOMER_ID" )
         val itemType = r.getAs[String]( "itemType" )
         val eventTimeStamp = r.getAs[String]( "eventTimeStamp" )
         val valueType = r.getAs[Map[String, Int]]( "valueType" )
         (customerId, (itemType, eventTimeStamp, valueType))
          }
          } )

Since my my input is huge, this takes much time. Is there any efficient way to convert the df to RDD[(String, (String, String, Map[String, Int]))] ?

Upvotes: 1

Views: 1727

Answers (1)

Tim
Tim

Reputation: 3725

The operation you've described is as cheap as it's going to get. Doing a few getAs and allocating a few tuples is almost free. If it's going slow, that's probably unavoidable due to your large data size (7T). Also note that Catalyst optimizations cannot be performed on RDDs, so including this kind of .map downstream of DataFrame operations will often prevent other Spark shortcuts.

Upvotes: 2

Related Questions