Reputation: 6933
I have a single test node with 8 GB ram on which I am loading barely 10 MB of data(from csv files) into Cassandra(on the same node itself). Im trying to process this data using spark(running on the same node).
Please note that for SPARK_MEM, Im allocating 1 GB of RAM and SPARK_WORKER_MEMORY I'm allocating the same. The allocation of any extra amount of memory results in spark throwing a "Check if all workers are registered and have sufficient memory error", which is more often than not indicative of Spark trying to look for extra memory(as per SPARK_MEM and SPARK_WORKER_MEMORY properties) and coming up short.
When I try to load and process all data in the Cassandra table using spark context object, I'm getting an error during processing. So, I'm trying to use a looping mechanism to read chunks of data at a time from one table, process them and put them in another table.
My source code has the following structure
var data=sc.cassandraTable("keyspacename","tablename").where("value=?",1)
data.map(x=>tranformFunction(x)).saveToCassandra("keyspacename","tablename")
for(i<-2 to 50000){
data=sc.cassandraTable("keyspacename","tablename").where("value=?",i)
data.map(x=>tranformFunction(x)).saveToCassandra("keyspacename","tablename")
}
Now, this works for a while, for around 200 loops, and then this throws an error: java.lang.OutOfMemoryError: unable to create a new native thread.
I've got two questions:
Is this the right way to deal with data?
How can processing just 10 MB of data do this to a cluster?
Upvotes: 2
Views: 1796
Reputation: 359
You are running a query inside the for loop. If the 'value' column is not a key/indexed column, Spark will load the table into memory and then filter on the value. This will certainly cause an OOM.
Upvotes: 1