Reputation: 12960
I am trying to convert spark DataFrame
to pandas DataFrame
. I am trying to in Jupyter notebook on EMR. and I am trying following error.
Pandas library is installed on master node under my user. And using spark shell (pyspark) I am able to convert df to pandas df on that master node.
Following command has been executed on all the master nodes:
pip --no-cache-dir install pandas --user
Following is working on master node. But not from pyspark notebook:
import Pandas as pd
Error:
module named 'Pandas'
Traceback (most recent call last):
ModuleNotFoundError: No module named 'Pandas'
Update:
I can run following code from python notebook:
import pandas as pd
pd.DataFrame(["a", "b"], columns=['q_data'])
Upvotes: 1
Views: 1068
Reputation: 41
We also kept getting the following error when we ran the EMR 5.33.0 step to create and manipulate dataframes .
File "/mnt/tmp/spark-49de09b2-5f77-4c46-a562-eed3742852be/test.py", line 131, in <module>
stores = df.toPandas()['storename'].unique().tolist()
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 2086, in toPandas
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 129, in require_minimum_pandas_version
ImportError: Pandas >= 0.19.2 must be installed; however, it was not found.
This is a misleading error as it is caused by conflicts of version mismatch for numpy and pandas packages. Our AWS support was able to find this one.
EMR runs it's own bootstrap after custom bootstrap actions (that you specify) which installs a set of libraries. These versions of "numpy" packages getting installed leads conflicts.For example, when we install "pandas==1.3.0" using the bootstrap script, it installs "numpy=1.21.2". But then, as part of EMR bootstrap (also called application provisioning), it's installing "numpy=1.16.5". Because of this, there is a mismatch in numpy version between what pip3 interprets and what python/pyspark interprets.
To fix it,
Step 1: Create a secondary bootstrap action script and upload it to S3
#!/bin/bash
# keep checking the status of node provisioning, once it's SUCCESSFUL, run your code
while true; do
NODEPROVISIONSTATE=` sed -n '/localInstance [{]/,/[}]/{
/nodeProvisionCheckinRecord [{]/,/[}]/ {
/status: / { p }
/[}]/a
}
/[}]/a
}' /emr/instance-controller/lib/info/job-flow-state.txt | awk ' { print $2 }'`
if [ "$NODEPROVISIONSTATE" == "SUCCESSFUL" ]; then
sleep 10;
echo "Running my post provision bootstrap"
sudo pip3 install pandas==1.3.0
fi
sleep 10;
done
Step 2: Modify your existing bootstrap script
#!/bin/bash -x
aws s3 cp s3://<BUCKET>/secondary-bootstrap.sh /home/hadoop/secondary-bootstrap.sh && sudo bash /home/hadoop/secondary-bootstrap.sh &
exit 0
Step 3: Relaunch your EMR cluster
Upvotes: 2
Reputation: 5536
You need pandas on the driver node as when converting to pandas df all the data is collected to the driver and then converted
Upvotes: 1