Reputation: 2441
Does it create the whole dataFrame in Memory?
How do I create a large dataFrame (> 1 million Rows) and persist it for later queries?
Upvotes: 1
Views: 868
Reputation: 2398
To persist it for later queries:
val sc: SparkContext = ...
val hc = new HiveContext( sc )
val df: DataFrame = myCreateDataFrameCode().
coalesce( 8 ).persist( StorageLevel.MEMORY_ONLY_SER )
df.show()
This will coalesce the DataFrame to 8 partitions before persisting it with serialization. Not sure I can say what number of partitions is best, perhaps even "1". Check StorageLevel
docs for other persistence options, such as MEMORY_AND_DISK_SER
, which will persist to both memory and disk.
In answer to the first question, yes I think Spark will need to create the whole DataFrame in memory before persisting it. If you're getting 'OutOfMemory', that's probably the key roadblock. You don't say how you're creating it. Perhaps there's some workaround, like creating and persisting it in smaller pieces, persisting to memory_and_disk with serialization, and then combining the pieces.
Upvotes: 1