Reputation: 434
I'm working my way through a book on Spark and I'm on a section dealing with the join method for dataframes. In this example, the "trips" table is being joined with the "stations" table:
trips = sqlContext.table("trips")
stations = sqlContext.table("stations")
joined = trips.join(stations, trips.start_terminal == stations.station_id)
joined.printSchema()
The data is supposed to come from two spreadsheets, trips.csv and stations.csv, but I don't know how Spark is supposed to figure that out. It seems to me that there should be a line indicating where "trips" and "stations" are supposed to come from.
If I try something like
trips = sqlContext.table('/home/l_preamble/Documents/trips.csv')
it doesn't like it "pyspark.sql.utils.ParseException: u"\nextraneous input '/' expecting {'SELECT', 'FROM', 'ADD'..."
So how can I point it in the direction of the data? Any help would be appreciated.
Upvotes: 1
Views: 4161
Reputation: 1189
In order to join two dataframes in pyspark, you should try this:-
df1=sqlContext.registerTempTable("trips")
df2=sqlContext.registerTempTable("stations")
df2.join(df1,['column_name'],outer)
Upvotes: 1
Reputation: 4719
I think, maybe you need this
spark = SparkSession.builder.appName('MyApp').getOrCreate()
df_trips = spark.read.load(path='/home/l_preamble/Documents/trips.csv', format='csv', sep=',')
df_trips.createOrReplaceTempView('trips')
result = spark.sql("""select * from trips""")
Upvotes: 1