Reputation: 1468
I have a simple hive query which works fine in yarn client mode using pyspark shell where as it throws me the below error when i run it in yarn-cluster mode.
Exception in thread "Thread-6"
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Thread-6"
Exception in thread "Reporter"
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Reporter"
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "sparkDriver-scheduler-1"
Cluster information: Hadoop 2.4, Spark 1.4.0-hadoop2.4 ,hive 0.13.1 The script takes 10 columns from a hive table and does some transformations and writes it to a file.
> num-executors 200 executor-memory 8G driver-memory 16G executor-cores 3
Full stack trace:
py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o62.javaToPython.
: java.lang.OutOfMemoryError: PermGen space at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2570)
at java.lang.Class.getDeclaredMethods(Class.java:1855)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:206)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1891)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:683)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:682)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:682)
at org.apache.spark.api.python.SerDeUtil$.javaToPython(SerDeUtil.scala:140)
at org.apache.spark.sql.DataFrame.javaToPython(DataFrame.scala:1435)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
Upvotes: 4
Views: 2030
Reputation: 484
Little addition to Mark's answer - sometimes Spark with HiveContext complains about OutOfMemoryError without any mention of PermGen, however only -XX:MaxPermSize helps.
So if you dealing with OOM when Spark + HiveContext is used, also try -XX:MaxPermSize
Upvotes: 0
Reputation: 364697
java.lang.OutOfMemoryError: PermGen space at java.lang.ClassLoader.defineClass1(...
You are likely running out of "permanent generation" heap space in the driver's JVM. This area is used to store classes. When we run in cluster mode, the JVM needs to load more classes (I think this is because the Application Manager runs inside the same JVM as the driver). To increase the PermGen area, add the following option:
--driver-java-options -XX:MaxPermSize=256M
See also https://plumbr.eu/outofmemoryerror/permgen-space
When using HiveContext in your Python program, I've found that the following option is also needed:
--files /usr/hdp/current/spark-client/conf/hive-site.xml
I've also wanted to specify a specific version of Python to use, which requires another option:
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=/usr/local/bin/python2.7
See also https://issues.apache.org/jira/browse/SPARK-9235
Upvotes: 1