Problems while transforming pandas dataframe to PySpark RDD?

Question

With pandas read_csv() function I read an iso-8859-1 file as follows:

df = pd.read_csv('path/file', \
                   sep = '|',names =['A','B'], encoding='iso-8859-1')

Then, I would like to use MLLib's word2vect. However, it only accepts as a parameter RDDs. So I tried to transform the pandas dataframe to an RDD as follows:

from pyspark.sql import SQLContext
spDF = sqlContext.createDataFrame(df['A'])
spDF.show()

Anyhow, I got the following exception:

TypeError: Can not infer schema for type:

I went to Pyspark's documentation in order to see if there is something like an encoding parameter, but I did not found anything. Any idea of how to transform an specific pandas dataframe column to a Pyspark RDD?.

update:

From @zeros answer this is what I tried save the columnn as a dataframe, like this:

new_dataframe = df_3.loc[:,'A']
new_dataframe.head()

Then:

from pyspark.sql import SQLContext
spDF = sqlContext.createDataFrame(new_dataframe)
spDF.show()

And I got the same exception:

TypeError: Can not infer schema for type:

zero323 · Accepted Answer

When you use df['A'] is not a pandas.DataFrame but pandas.Series hence when you pass it to SqlContext.createDataFrame it is treated as any other Iterable and PySpark doesn't support conversion of simple types to DataFrame.

If you want to keep data as Pandas DataFrame use loc method:

df.loc[:,'A']

Problems while transforming pandas dataframe to PySpark RDD?

Answers (2)

Related Questions