Reputation: 741
I have a dataFrame = [CUSTOMER_ID ,itemType, eventTimeStamp, valueType]
which I convert to RDD[(String, (String, String, Map[String, Int]))]
by doing the following:
val tempFile = result.map( {
r => {
val customerId = r.getAs[String]( "CUSTOMER_ID" )
val itemType = r.getAs[String]( "itemType" )
val eventTimeStamp = r.getAs[String]( "eventTimeStamp" )
val valueType = r.getAs[Map[String, Int]]( "valueType" )
(customerId, (itemType, eventTimeStamp, valueType))
}
} )
Since my my input is huge, this takes much time. Is there any efficient way to convert the df
to RDD[(String, (String, String, Map[String, Int]))]
?
Upvotes: 1
Views: 1727
Reputation: 3725
The operation you've described is as cheap as it's going to get. Doing a few getAs
and allocating a few tuples is almost free. If it's going slow, that's probably unavoidable due to your large data size (7T). Also note that Catalyst optimizations cannot be performed on RDDs, so including this kind of .map
downstream of DataFrame operations will often prevent other Spark shortcuts.
Upvotes: 2