The "table" in sqlContext.table

Question

I'm working my way through a book on Spark and I'm on a section dealing with the join method for dataframes. In this example, the "trips" table is being joined with the "stations" table:

trips = sqlContext.table("trips")
stations = sqlContext.table("stations")
joined = trips.join(stations, trips.start_terminal == stations.station_id)
joined.printSchema()

The data is supposed to come from two spreadsheets, trips.csv and stations.csv, but I don't know how Spark is supposed to figure that out. It seems to me that there should be a line indicating where "trips" and "stations" are supposed to come from.

If I try something like

trips = sqlContext.table('/home/l_preamble/Documents/trips.csv')

it doesn't like it "pyspark.sql.utils.ParseException: u" extraneous input '/' expecting {'SELECT', 'FROM', 'ADD'..."

So how can I point it in the direction of the data? Any help would be appreciated.

hitttt · Accepted Answer

In order to join two dataframes in pyspark, you should try this:-

df1=sqlContext.registerTempTable("trips")
df2=sqlContext.registerTempTable("stations")

df2.join(df1,['column_name'],outer)

The "table" in sqlContext.table

Answers (2)

Related Questions

The &quot;table&quot; in sqlContext.table

Answers (2)

Related Questions

The "table" in sqlContext.table