Baktaawar
Baktaawar

Reputation: 7490

Spark Java Heap error

I am not sure what is happening here and why.

I have a data frame which is both loaded as pandas and spark data frame.

The data frame is sparse meaning mostly zeros. It's dimensions are 56K X 9K. So not that big

I also put the following commands in my spark/conf/spark-defaults.conf file

spark.driver.memory              8g
spark.executor.memory            2g
spark.driver.maxResultSize       2g
spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value

spark.jars.packages    com.databricks:spark-csv_2.11:1.4.0

So if you see, I have already allocated 8GB for Driver and 2G for executor. I am using Spark installed locally on my Macbook Pro.

When I do

recommender_ct.show() 

to see first 5 lines this is what I get:

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-7-8c71bfcdfd03> in <module>()
----> 1 recommender_ct.show()

/Users/i854319/spark/python/pyspark/sql/dataframe.pyc in show(self, n, truncate)
    255         +---+-----+
    256         """
--> 257         print(self._jdf.showString(n, truncate))
    258 
    259     def __repr__(self):

/Users/i854319/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
    811         answer = self.gateway_client.send_command(command)
    812         return_value = get_return_value(
--> 813             answer, self.gateway_client, self.target_id, self.name)
    814 
    815         for temp_arg in temp_args:

/Users/i854319/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)
     43     def deco(*a, **kw):
     44         try:
---> 45             return f(*a, **kw)
     46         except py4j.protocol.Py4JJavaError as e:
     47             s = e.java_exception.toString()

/Users/i854319/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    306                 raise Py4JJavaError(
    307                     "An error occurred while calling {0}{1}{2}.\n".
--> 308                     format(target_id, ".", name), value)
    309             else:
    310                 raise Py4JError(

Py4JJavaError: An error occurred while calling o40.showString.
: java.lang.OutOfMemoryError: Java heap space

This data frame was created using cross-tab of a Spark data frame as below:

recommender_ct=recommender_sdf.crosstab('TRANS','ITEM')

The spark data frame above recommender_sdf works fine when .show() is used for that.

The same cross tab method is used for pandas data frame and when I do below it works very fine.

# Creating a new pandas dataframe for cross-tab
recommender_pct=pd.crosstab(recommender_pdf['TRANS'], recommender_pdf['ITEM'])

recommender_pct.head()

This works immediately.

So that shows that the file is easily able to get loaded in memory and can be used by pandas, but the same data frame in spark when used .show() or .head() is throwing the java heap error. And it is taking lot of time before throwing the error.

I don't understand why is this happening. Isn't Spark supposed to be faster than pandas and shouldn't have this memory issue when same data frame can be easily accessed and printed using pandas.

EDIT:

Ok. The cross-tabbed spark data frame looks like this when I fetch first few rows and columns from the corresponding pandas data frame

    TRANS   Bob Iger: How Do Companies Foster Innovation and Sustain Relevance  “I Will What I Want” - Misty Copeland   "On the Lot" with Disney Producers  "Your Brain is Good at Inclusion...Except When it's Not" with Dr. Steve Robbins (please do not use) WDW_ER-Leaders of Minors    1. EAS Project Lifecycle Process Flow   10 Life Lessons from Star Wars  10 Simple Steps to Exceptional Daily Productivity   10 Steps to Effective Listening
0   353 0   0   0   0   0   0   0   0   0
1   354 0   0   0   0   0   0   0   0   0
2   355 0   0   0   0   0   0   0   0   0
3   356 0   0   0   0   0   0   0   0   0
4   357 0   0   0   0   0   0   0   0   0

The column names are basically long text strings. And the column values are either 0 or 1

Upvotes: 1

Views: 2889

Answers (1)

Visar
Visar

Reputation: 1

How I solved the same problem in Java: Divide queries that need to be executed into 2 (or more) parts. Execute the first half, save the results to HDFS (as parquet). Create second SqlContext, read results from first half from HDFS and then execute the second half of queries.

Upvotes: 0

Related Questions