tumbleweed
tumbleweed

Reputation: 4660

Problems while transforming pandas dataframe to PySpark RDD?

With pandas read_csv() function I read an iso-8859-1 file as follows:

df = pd.read_csv('path/file', \
                   sep = '|',names =['A','B'], encoding='iso-8859-1')

Then, I would like to use MLLib's word2vect. However, it only accepts as a parameter RDDs. So I tried to transform the pandas dataframe to an RDD as follows:

from pyspark.sql import SQLContext
spDF = sqlContext.createDataFrame(df['A'])
spDF.show()

Anyhow, I got the following exception:

TypeError: Can not infer schema for type: <type 'unicode'>

I went to Pyspark's documentation in order to see if there is something like an encoding parameter, but I did not found anything. Any idea of how to transform an specific pandas dataframe column to a Pyspark RDD?.

update:

From @zeros answer this is what I tried save the columnn as a dataframe, like this:

new_dataframe = df_3.loc[:,'A']
new_dataframe.head()

Then:

from pyspark.sql import SQLContext
spDF = sqlContext.createDataFrame(new_dataframe)
spDF.show()

And I got the same exception:

TypeError: Can not infer schema for type: <type 'unicode'>

Upvotes: 0

Views: 1165

Answers (2)

tumbleweed
tumbleweed

Reputation: 4660

From @zeros323 answer I noted that it actually was not a pandas dataframe. I consulted pandas documentation and found that to_frame() can convert that specific column in a pandas dataframe. So I did the following:

new_dataframe = df['A'].to_frame()
new_dataframe.head()
from pyspark.sql import SQLContext
spDF = sqlContext.createDataFrame(new_dataframe)
spDF.show()

Upvotes: 0

zero323
zero323

Reputation: 330363

When you use df['A'] is not a pandas.DataFrame but pandas.Series hence when you pass it to SqlContext.createDataFrame it is treated as any other Iterable and PySpark doesn't support conversion of simple types to DataFrame.

If you want to keep data as Pandas DataFrame use loc method:

df.loc[:,'A']

Upvotes: 2

Related Questions