Ma Pengfei
Ma Pengfei

Reputation: 11

'RDD' object has no attribute 'sparkSession'

Here is my code, I tried to import everything but still report error.

from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
import pyspark

sc = SparkContext(appName="session1")

a = [('Chris', 'Budweiser', 15), ('Chris', 'Becks', 5), ('Chris', 'Heineken', 2), ('Bob', 'Becks', 15), ('Bob', 'Budweiser', 10) , ('Bob', 'Heineken', 2) ,  ('Alice', 'Heineken', 8) ] 

rdd = sc.parallelize(a)

sparkSession=spark

df = SQLContext.createDataFrame(rdd, ['drinker', 'beer', 'score'])

SQLContext.registerDataFrameAsTable(df, "drinkers")

sc.stop()

The error is

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-3-cfe5dd94abe2> in <module>
     11 sparkSession=spark
     12 
---> 13 df = SQLContext.createDataFrame(rdd, ['drinker', 'beer', 'score'])
     14 
     15 SQLContext.registerDataFrameAsTable(df, "drinkers")

~/opt/anaconda3/lib/python3.7/site-packages/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio, verifySchema)
    367         Py4JJavaError: ...
    368         """
--> 369         return self.sparkSession.createDataFrame(data, schema, samplingRatio, verifySchema)
    370 
    371     def registerDataFrameAsTable(self, df, tableName):

AttributeError: 'RDD' object has no attribute 'sparkSession'

I tried to import more packages but it does not work. And this looks like not a normal error. Is that probably because I need to pip install something?

Upvotes: 0

Views: 400

Answers (1)

pltc
pltc

Reputation: 6082

Your original code df = SQLContext.createDataFrame(rdd, ['drinker', 'beer', 'score']) is wrong (SQLContext is a class not an instance, and wrong API).

This is (one of) the correct way:

df = (spark
    .sparkContext
    .parallelize([
        ('Chris', 'Budweiser', 15),
        ('Chris', 'Becks', 5),
        ('Chris', 'Heineken', 2),
        ('Bob', 'Becks', 15),
        ('Bob', 'Budweiser', 10),
        ('Bob', 'Heineken', 2),
        ('Alice', 'Heineken', 8)
    ])
    .toDF(['drinker', 'beer', 'score'])
)

df.show()

# Output
# +-------+---------+-----+
# |drinker|     beer|score|
# +-------+---------+-----+
# |  Chris|Budweiser|   15|
# |  Chris|    Becks|    5|
# |  Chris| Heineken|    2|
# |    Bob|    Becks|   15|
# |    Bob|Budweiser|   10|
# |    Bob| Heineken|    2|
# |  Alice| Heineken|    8|
# +-------+---------+-----+

Upvotes: 0

Related Questions