Reputation: 21
So, I have around 35 GB of zip files, each one contains 15 csv files, I have created a scala script that processes each one of the zip files and each one of the csv files per each zip file.
The problem is that after some amount of files the script lunches this error
ERROR Executor: Exception in task 0.0 in stage 114.0 (TID 3145) java.io.IOException: java.sql.BatchUpdateException: (Server=localhost/127.0.0.1[1528] Thread=pool-3-thread-63) XCL54.T : [0] insert of keys [7243901, 7243902,
And the string continues with all the keys (records) that were not inserted.
So what I have found is that apparently (I said apparently because of my lack of knowledge about scala and snappy and spark) the memory that is been used is full... my question... how do I increment the size of the memory used? or how do I empty the data that is in memory and save it in the disk?
Can I close the session started and that way free the memory? I have had to restart the server, remove the files processed and then I can continue with the importation but after some other files... again... same exception
My csv files are big... the biggest one is around 1 GB but this exception happens not just with the big files but when accumulating multiple files... until some size is reached... so where do I change that memory use size?
I have 12GB RAM...
Upvotes: 1
Views: 243
Reputation: 165
BatchUpdateException tells me that you are creating Snappy tables and inserting data in them. Also, BatchUpdateException in most of the cases means low memory (exception message needs to be better). So, I believe you may be right about the memory. For freeing the memory, you will have to drop the tables that you created. For information about memory size and table sizing, you may want to read these docs:
Also if you have lot of data that can't fit in memory, you can overflow it to disk. See the following doc about the overflow configuration:
http://snappydatainc.github.io/snappydata/best_practices/design_schema/#overflow-configuration
Hope it helps.
Upvotes: 1
Reputation: 535
I think you are running out of available memory. The exception message is misleading. If you only have 12GB of memory on your machine, I wonder if your data would fit. What I would do is first figure out how memory you need.
1. Copy conf/servers.template to conf/servers file
2) Change this file with something like this: localhost -heap-size=3g
-memory-size=6g //this essentially allocates 3g in your server for computations (spark, etc) and allocates 6g of off-heap memory for your data (column tables only).
3) start your cluster using snappy-start-all.sh
4) Load some subset of your data (I doubt you have enough memory)
5) Check the memory used in the SnappyData Pulse UI (localhost:5050)
if you think you have enough memory, load the full data.
Hopefully that works out.
Upvotes: 1
Reputation: 4045
You can use RDD persistance and store to disk/memory or a combination : https://spark.apache.org/docs/2.1.0/programming-guide.html#rdd-persistence
Also, try adding a large number of partitions when reading the file(s): sc.textFile(path, 200000)
Upvotes: 1