Reputation: 4660
With pandas read_csv()
function I read an iso-8859-1
file as follows:
df = pd.read_csv('path/file', \
sep = '|',names =['A','B'], encoding='iso-8859-1')
Then, I would like to use MLLib's word2vect. However, it only accepts as a parameter RDDs. So I tried to transform the pandas dataframe to an RDD as follows:
from pyspark.sql import SQLContext
spDF = sqlContext.createDataFrame(df['A'])
spDF.show()
Anyhow, I got the following exception:
TypeError: Can not infer schema for type: <type 'unicode'>
I went to Pyspark's documentation in order to see if there is something like an encoding parameter, but I did not found anything. Any idea of how to transform an specific pandas dataframe column to a Pyspark RDD?.
update:
From @zeros answer this is what I tried save the columnn as a dataframe, like this:
new_dataframe = df_3.loc[:,'A']
new_dataframe.head()
Then:
from pyspark.sql import SQLContext
spDF = sqlContext.createDataFrame(new_dataframe)
spDF.show()
And I got the same exception:
TypeError: Can not infer schema for type: <type 'unicode'>
Upvotes: 0
Views: 1165
Reputation: 4660
From @zeros323 answer I noted that it actually was not a pandas dataframe. I consulted pandas documentation and found that to_frame()
can convert that specific column in a pandas dataframe. So I did the following:
new_dataframe = df['A'].to_frame()
new_dataframe.head()
from pyspark.sql import SQLContext
spDF = sqlContext.createDataFrame(new_dataframe)
spDF.show()
Upvotes: 0
Reputation: 330363
When you use df['A']
is not a pandas.DataFrame
but pandas.Series
hence when you pass it to SqlContext.createDataFrame
it is treated as any other Iterable
and PySpark doesn't support conversion of simple types to DataFrame
.
If you want to keep data as Pandas DataFrame
use loc
method:
df.loc[:,'A']
Upvotes: 2