Reputation: 7490
I am not sure what is happening here and why.
I have a data frame which is both loaded as pandas and spark data frame.
The data frame is sparse meaning mostly zeros. It's dimensions are 56K X 9K. So not that big
I also put the following commands in my spark/conf/spark-defaults.conf file
spark.driver.memory 8g
spark.executor.memory 2g
spark.driver.maxResultSize 2g
spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value
spark.jars.packages com.databricks:spark-csv_2.11:1.4.0
So if you see, I have already allocated 8GB for Driver and 2G for executor. I am using Spark installed locally on my Macbook Pro.
When I do
recommender_ct.show()
to see first 5 lines this is what I get:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-7-8c71bfcdfd03> in <module>()
----> 1 recommender_ct.show()
/Users/i854319/spark/python/pyspark/sql/dataframe.pyc in show(self, n, truncate)
255 +---+-----+
256 """
--> 257 print(self._jdf.showString(n, truncate))
258
259 def __repr__(self):
/Users/i854319/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
811 answer = self.gateway_client.send_command(command)
812 return_value = get_return_value(
--> 813 answer, self.gateway_client, self.target_id, self.name)
814
815 for temp_arg in temp_args:
/Users/i854319/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)
43 def deco(*a, **kw):
44 try:
---> 45 return f(*a, **kw)
46 except py4j.protocol.Py4JJavaError as e:
47 s = e.java_exception.toString()
/Users/i854319/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
306 raise Py4JJavaError(
307 "An error occurred while calling {0}{1}{2}.\n".
--> 308 format(target_id, ".", name), value)
309 else:
310 raise Py4JError(
Py4JJavaError: An error occurred while calling o40.showString.
: java.lang.OutOfMemoryError: Java heap space
This data frame was created using cross-tab of a Spark data frame as below:
recommender_ct=recommender_sdf.crosstab('TRANS','ITEM')
The spark data frame above recommender_sdf works fine when .show() is used for that.
The same cross tab method is used for pandas data frame and when I do below it works very fine.
# Creating a new pandas dataframe for cross-tab
recommender_pct=pd.crosstab(recommender_pdf['TRANS'], recommender_pdf['ITEM'])
recommender_pct.head()
This works immediately.
So that shows that the file is easily able to get loaded in memory and can be used by pandas, but the same data frame in spark when used .show() or .head() is throwing the java heap error. And it is taking lot of time before throwing the error.
I don't understand why is this happening. Isn't Spark supposed to be faster than pandas and shouldn't have this memory issue when same data frame can be easily accessed and printed using pandas.
EDIT:
Ok. The cross-tabbed spark data frame looks like this when I fetch first few rows and columns from the corresponding pandas data frame
TRANS Bob Iger: How Do Companies Foster Innovation and Sustain Relevance “I Will What I Want” - Misty Copeland "On the Lot" with Disney Producers "Your Brain is Good at Inclusion...Except When it's Not" with Dr. Steve Robbins (please do not use) WDW_ER-Leaders of Minors 1. EAS Project Lifecycle Process Flow 10 Life Lessons from Star Wars 10 Simple Steps to Exceptional Daily Productivity 10 Steps to Effective Listening
0 353 0 0 0 0 0 0 0 0 0
1 354 0 0 0 0 0 0 0 0 0
2 355 0 0 0 0 0 0 0 0 0
3 356 0 0 0 0 0 0 0 0 0
4 357 0 0 0 0 0 0 0 0 0
The column names are basically long text strings. And the column values are either 0 or 1
Upvotes: 1
Views: 2889
Reputation: 1
How I solved the same problem in Java: Divide queries that need to be executed into 2 (or more) parts. Execute the first half, save the results to HDFS (as parquet). Create second SqlContext, read results from first half from HDFS and then execute the second half of queries.
Upvotes: 0