Reputation: 12960

converting spark dataframe to pandas dataframe - ImportError: Pandas >= 0.19.2 must be installed

I am trying to convert spark DataFrame to pandas DataFrame. I am trying to in Jupyter notebook on EMR. and I am trying following error.

Pandas library is installed on master node under my user. And using spark shell (pyspark) I am able to convert df to pandas df on that master node.

Following command has been executed on all the master nodes:

 pip --no-cache-dir install pandas --user

Following is working on master node. But not from pyspark notebook:

import Pandas as pd

Error:

module named 'Pandas'
Traceback (most recent call last):
ModuleNotFoundError: No module named 'Pandas'

Update:

I can run following code from python notebook:

import pandas as pd 
pd.DataFrame(["a", "b"], columns=['q_data'])

Upvotes: 1

Answers (2)

Alokika Dash

Reputation: 41

We also kept getting the following error when we ran the EMR 5.33.0 step to create and manipulate dataframes .

  File "/mnt/tmp/spark-49de09b2-5f77-4c46-a562-eed3742852be/test.py", line 131, in <module>
    stores = df.toPandas()['storename'].unique().tolist()
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 2086, in toPandas
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 129, in require_minimum_pandas_version
ImportError: Pandas >= 0.19.2 must be installed; however, it was not found.

This is a misleading error as it is caused by conflicts of version mismatch for numpy and pandas packages. Our AWS support was able to find this one.

EMR runs it's own bootstrap after custom bootstrap actions (that you specify) which installs a set of libraries. These versions of "numpy" packages getting installed leads conflicts.For example, when we install "pandas==1.3.0" using the bootstrap script, it installs "numpy=1.21.2". But then, as part of EMR bootstrap (also called application provisioning), it's installing "numpy=1.16.5". Because of this, there is a mismatch in numpy version between what pip3 interprets and what python/pyspark interprets.

To fix it,

Step 1: Create a secondary bootstrap action script and upload it to S3

#!/bin/bash

# keep checking the status of node provisioning, once it's SUCCESSFUL, run your code

while true; do
    NODEPROVISIONSTATE=` sed -n '/localInstance [{]/,/[}]/{
                /nodeProvisionCheckinRecord [{]/,/[}]/ {
                /status: / { p }
                /[}]/a
                }
                /[}]/a
                }' /emr/instance-controller/lib/info/job-flow-state.txt | awk ' { print $2 }'`

if [ "$NODEPROVISIONSTATE" == "SUCCESSFUL" ]; then
    sleep 10;
    echo "Running my post provision bootstrap"

    sudo pip3 install pandas==1.3.0

fi

sleep 10;
done

Step 2: Modify your existing bootstrap script

#!/bin/bash -x
aws s3 cp s3://<BUCKET>/secondary-bootstrap.sh /home/hadoop/secondary-bootstrap.sh && sudo bash /home/hadoop/secondary-bootstrap.sh &
exit 0

Step 3: Relaunch your EMR cluster

Upvotes: 2

Shubham Jain

Reputation: 5536

You need pandas on the driver node as when converting to pandas df all the data is collected to the driver and then converted

Upvotes: 1

converting spark dataframe to pandas dataframe - ImportError: Pandas &gt;= 0.19.2 must be installed

Answers (2)

Related Questions

converting spark dataframe to pandas dataframe - ImportError: Pandas >= 0.19.2 must be installed