Reputation: 11
Problem:
I want to query my table which stored in Hive through the SparkSQL JDBC interface.
And want to fetch more than 1,000,000 rows. But met OOM.
sql = "select * from TEMP_ADMIN_150601_000001 limit XXX ";
My Env: 5 Nodes = One master + 4 workers, 1000M Network Switch , Redhat 6.5 Each node: 8G RAM, 500G Harddisk Java 1.6, Scala 2.10.4, Hadoop 2.6, Spark 1.3.0, Hive 0.13
Data:
A table with user and there charge for electricity data.
About 1,600,000 Rows. About 28MB.
Each row occupy about 18 Bytes.
2 columns: user_id String, total_num Double
Repro Steps: 1. Start Spark 2. Start SparkSQL thriftserver, command:
/usr/local/spark/spark-1.3.0/sbin/start-thriftserver.sh \
--master spark://cx-spark-001:7077 \
--conf spark.executor.memory=4g \
--conf spark.driver.memory=2g \
--conf spark.shuffle.consolidateFiles=true \
--conf spark.shuffle.manager=sort \
--conf "spark.executor.extraJavaOptions=-XX:-UseGCOverheadLimit" \
--conf spark.file.transferTo=false \
--conf spark.akka.timeout=2000 \
--conf spark.storage.memoryFraction=0.4 \
--conf spark.cores.max=8 \
--conf spark.kryoserializer.buffer.mb=256 \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.akka.frameSize=512 \
--driver-class-path /usr/local/hive/lib/classes12.jar
Preliminary diagnose: 6. When fetching less than 1,000,000 rows , it always success. 7. When fetching more than 1,300,000 rows , it always fail with OOM: GC overhead limit exceeded. 8. When fetching about 1,040,000-1,200,000 rows, if query right after the thrift server start up, most times success. if I successfully query once then retry the same query, it will fail. 9. There are 3 dead pattern: OOM:GC overhead limit exceeded or OOM: java heap space or lost worker heartbeat after 120s. 10. I tried to start thrift with different configure, give the worker 4G MEM or 2G MEM , got the same behavior. That means , no matter the total MEM of worker, i can get less than 1,000,000 rows, and can not get more than 1,300,000 rows.
Preliminary conclusions: 11. The total data is less than 30MB, It is so small, And there is no complex computation operation. So the failure is not caused by excessive memory requirements. So I guess there are some defect in spark sql code. 12. Allocate 2G or 4G MEM to each worker, got same behavior. This point strengthen my doubts: there are some defect in code. But I can't find the specific location.
Upvotes: 1
Views: 992
Reputation: 1652
What you additionally could try is to set spark.sql.thriftServer.incrementalCollects
to true
. The effects are described in https://issues.apache.org/jira/browse/SPARK-25224 pretty nicely!
Upvotes: 0
Reputation: 5817
Because spark workers send all task results to driver program (ThriftServer) and the driver program will collect all task results into org.apache.spark.sql.Row[TASK_COUNT][ROW_COUNT]
array.
This is the root cause to make ThriftServer OOM.
Upvotes: 1