Harit Vishwakarma
Harit Vishwakarma

Reputation: 2441

Spark SQL createDataFrame() raising OutOfMemory exception

Does it create the whole dataFrame in Memory?

How do I create a large dataFrame (> 1 million Rows) and persist it for later queries?

Upvotes: 1

Views: 868

Answers (1)

rake
rake

Reputation: 2398

To persist it for later queries:

val sc: SparkContext = ...
val hc = new HiveContext( sc )
val df: DataFrame = myCreateDataFrameCode().
          coalesce( 8 ).persist( StorageLevel.MEMORY_ONLY_SER )
df.show()

This will coalesce the DataFrame to 8 partitions before persisting it with serialization. Not sure I can say what number of partitions is best, perhaps even "1". Check StorageLevel docs for other persistence options, such as MEMORY_AND_DISK_SER, which will persist to both memory and disk.

In answer to the first question, yes I think Spark will need to create the whole DataFrame in memory before persisting it. If you're getting 'OutOfMemory', that's probably the key roadblock. You don't say how you're creating it. Perhaps there's some workaround, like creating and persisting it in smaller pieces, persisting to memory_and_disk with serialization, and then combining the pieces.

Upvotes: 1

Related Questions