Problem: I want to query my table which stored in Hive through the SparkSQL JDBC interface. And want to fetch more than 1,000,000 rows. But met OOM. sql = "select * from TEMP_ADMIN_150601_000001 limit XXX "; My Env: 5 Nodes = One master + 4 workers, 1000M Network Switch , Redhat 6.5 Each node: 8G RAM, 500G Harddisk Java 1.6, Scala 2.10.4, Hadoop 2.6, Spark 1.3.0, Hive 0.13 Data: A table with user and there charge for electricity data. About 1,600,000 Rows. About 28MB. Each row occupy about 18 Bytes. 2 columns: user_id String, total_num Double Repro Steps: 1. Start Spark 2. Start SparkSQL thriftserver, command: /usr/local/spark/spark-1.3.0/sbin/start-thriftserver.sh \ --master spark://cx-spark-001:7077 \ --conf spark.executor.memory=4g \ --conf spark.driver.memory=2g \ --conf spark.shuffle.consolidateFiles=true \ --conf spark.shuffle.manager=sort \ --conf "spark.executor.extraJavaOptions=-XX:-UseGCOverheadLimit" \ --conf spark.file.transferTo=false \ --conf spark.akka.timeout=2000 \ --conf spark.storage.memoryFraction=0.4 \ --conf spark.cores.max=8 \ --conf spark.kryoserializer.buffer.mb=256 \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf spark.akka.frameSize=512 \ --driver-class-path /usr/local/hive/lib/classes12.jar Run the test code, see it in attached file: testHiveJDBC.java Get the OOM:GC overhead limit exceeded or OOM: java heap space or lost worker heartbeat after 120s. see the attached logs. Preliminary diagnose: 6. When fetching less than 1,000,000 rows , it always success. 7. When fetching more than 1,300,000 rows , it always fail with OOM: GC overhead limit exceeded. 8. When fetching about 1,040,000-1,200,000 rows, if query right after the thrift server start up, most times success. if I successfully query once then retry the same query, it will fail. 9. There are 3 dead pattern: OOM:GC overhead limit exceeded or OOM: java heap space or lost worker heartbeat after 120s. 10. I tried to start thrift with different configure, give the worker 4G MEM or 2G MEM , got the same behavior. That means , no matter the total MEM of worker, i can get less than 1,000,000 rows, and can not get more than 1,300,000 rows. Preliminary conclusions: 11. The total data is less than 30MB, It is so small, And there is no complex computation operation. So the failure is not caused by excessive memory requirements. So I guess there are some defect in spark sql code. 12. Allocate 2G or 4G MEM to each worker, got same behavior. This point strengthen my doubts: there are some defect in code. But I can't find the specific location.

Reputation: 11

Meet OOM when I want to fetch more than 1,000,000 rows in apache-spark

Problem: I want to query my table which stored in Hive through the SparkSQL JDBC interface. And want to fetch more than 1,000,000 rows. But met OOM. sql = "select * from TEMP_ADMIN_150601_000001 limit XXX ";

My Env: 5 Nodes = One master + 4 workers, 1000M Network Switch , Redhat 6.5 Each node: 8G RAM, 500G Harddisk Java 1.6, Scala 2.10.4, Hadoop 2.6, Spark 1.3.0, Hive 0.13

Data: A table with user and there charge for electricity data. About 1,600,000 Rows. About 28MB. Each row occupy about 18 Bytes. 2 columns: user_id String, total_num Double

Repro Steps: 1. Start Spark 2. Start SparkSQL thriftserver, command:

/usr/local/spark/spark-1.3.0/sbin/start-thriftserver.sh \
--master spark://cx-spark-001:7077 \
--conf spark.executor.memory=4g \
--conf spark.driver.memory=2g \
--conf spark.shuffle.consolidateFiles=true \
--conf spark.shuffle.manager=sort \
--conf "spark.executor.extraJavaOptions=-XX:-UseGCOverheadLimit" \
--conf spark.file.transferTo=false \
--conf spark.akka.timeout=2000 \
--conf spark.storage.memoryFraction=0.4 \
--conf spark.cores.max=8 \
--conf spark.kryoserializer.buffer.mb=256 \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.akka.frameSize=512 \
--driver-class-path /usr/local/hive/lib/classes12.jar

Run the test code, see it in attached file: testHiveJDBC.java
Get the OOM:GC overhead limit exceeded or OOM: java heap space or lost worker heartbeat after 120s. see the attached logs.

Preliminary diagnose: 6. When fetching less than 1,000,000 rows , it always success. 7. When fetching more than 1,300,000 rows , it always fail with OOM: GC overhead limit exceeded. 8. When fetching about 1,040,000-1,200,000 rows, if query right after the thrift server start up, most times success. if I successfully query once then retry the same query, it will fail. 9. There are 3 dead pattern: OOM:GC overhead limit exceeded or OOM: java heap space or lost worker heartbeat after 120s. 10. I tried to start thrift with different configure, give the worker 4G MEM or 2G MEM , got the same behavior. That means , no matter the total MEM of worker, i can get less than 1,000,000 rows, and can not get more than 1,300,000 rows.

Preliminary conclusions: 11. The total data is less than 30MB, It is so small, And there is no complex computation operation. So the failure is not caused by excessive memory requirements. So I guess there are some defect in spark sql code. 12. Allocate 2G or 4G MEM to each worker, got same behavior. This point strengthen my doubts: there are some defect in code. But I can't find the specific location.

Upvotes: 1

Answers (2)

oae

Reputation: 1652

What you additionally could try is to set spark.sql.thriftServer.incrementalCollects to true. The effects are described in https://issues.apache.org/jira/browse/SPARK-25224 pretty nicely!

Upvotes: 0

study

Reputation: 5817

Because spark workers send all task results to driver program (ThriftServer) and the driver program will collect all task results into org.apache.spark.sql.Row[TASK_COUNT][ROW_COUNT] array.

This is the root cause to make ThriftServer OOM.

Upvotes: 1

Meet OOM when I want to fetch more than 1,000,000 rows in apache-spark

Answers (2)

Related Questions