rmahesh
rmahesh

Reputation: 749

Error with Pandas command on Spark?

I would like to preface by saying I am very new to Spark. I have a working program on Pandas that I need to run on Spark. I am using Databricks to do this. After initializing 'sqlContext' and 'sc', I load in a CSV file and create a Spark dataframe. After doing this, I then convert this dataframe into a Pandas dataframe, where I have already wrote code to do what I need to do.

Objective: I need to load in a CSV file and identify the data types and return the data types of each and every column. The tricky part is that dates come in a variety of formats, for which I have written (with help from this community) regular expressions to match. I do this for every data type. At the end, I convert the columns to the correct type and print each column type.

After successfully loading my Pandas dataframe in, I am getting this error: "TypeError: to_numeric() got an unexpected keyword argument 'downcast' "

The code that I am running that triggered this:

 # Changing the column data types
if len(int_count) == len(str_count):
    df[lst[col]] = pd.to_numeric(df[lst[col]], errors='coerce', downcast='integer')
if len(float_count) == len(str_count):
    df[lst[col]] = pd.to_numeric(df[lst[col]], errors='coerce', downcast='float')
if len(boolean_count) == len(str_count):
    df[lst[col]] = df[lst[col]].astype('bool')
if len(date_count) == len(str_count):
    df[lst[col]] = pd.to_datetime(df[lst[col]], errors='coerce')

'lst' is the column header and 'col' is a variable I used to iterate through the column headers. This code worked perfectly when running on PyCharm. Not sure why I am getting this error on Spark.

Any help would be great!

Upvotes: 0

Views: 1045

Answers (1)

desertnaut
desertnaut

Reputation: 60400

From your comments:

I have tried to load the initial data directly into pandas df but it has consistently thrown me an error, saying the file doesn't exist, which is why I have had to convert it after loading it into Spark.

So, my answer has nothing to do with Spark, only with uploading data to Databricks Cloud (Community Edition), which seems to be your real issue here.

After initializing a cluster and uploading a file user_info.csv, we get this screenshot:

enter image description here

including the actual path for our uploaded file.

Now, in a Databricks notebook, if you try to use this exact path with pandas, you'll get a File does not exist error:

 import pandas as pd
 pandas_df = pd.read_csv("/FileStore/tables/1zpotrjo1499779563504/user_info.csv")
 [...]
 IOError: File /FileStore/tables/1zpotrjo1499779563504/user_info.csv does not exist

because, as the instructions clearly mention, in that case (i.e. files you want loaded directly in pandas or R instead of Spark) you need to prepend the file path with /dbfs:

 pandas_df = pd.read_csv("/dbfs/FileStore/tables/1zpotrjo1499779563504/user_info.csv") # works OK
 pandas_df.head() # works OK

Upvotes: 1

Related Questions