lengthy_preamble
lengthy_preamble

Reputation: 434

The "table" in sqlContext.table

I'm working my way through a book on Spark and I'm on a section dealing with the join method for dataframes. In this example, the "trips" table is being joined with the "stations" table:

trips = sqlContext.table("trips")
stations = sqlContext.table("stations")
joined = trips.join(stations, trips.start_terminal == stations.station_id)
joined.printSchema()

The data is supposed to come from two spreadsheets, trips.csv and stations.csv, but I don't know how Spark is supposed to figure that out. It seems to me that there should be a line indicating where "trips" and "stations" are supposed to come from.

If I try something like

trips = sqlContext.table('/home/l_preamble/Documents/trips.csv')

it doesn't like it "pyspark.sql.utils.ParseException: u"\nextraneous input '/' expecting {'SELECT', 'FROM', 'ADD'..."

So how can I point it in the direction of the data? Any help would be appreciated.

Upvotes: 1

Views: 4161

Answers (2)

hitttt
hitttt

Reputation: 1189

In order to join two dataframes in pyspark, you should try this:-

df1=sqlContext.registerTempTable("trips")
df2=sqlContext.registerTempTable("stations")

df2.join(df1,['column_name'],outer)

Upvotes: 1

Zhang Tong
Zhang Tong

Reputation: 4719

I think, maybe you need this

spark = SparkSession.builder.appName('MyApp').getOrCreate()
df_trips = spark.read.load(path='/home/l_preamble/Documents/trips.csv', format='csv', sep=',')
df_trips.createOrReplaceTempView('trips')
result = spark.sql("""select * from trips""")

Upvotes: 1

Related Questions