Reputation: 988
I have 2 pysaprk dataframes.
I am looking for option to join df1 with df2. Left join only with the first row from df2.
df1:
ID string
1 sfafsda
2 trwe
3 gfdgsd
df2
ID address state
1 Montreal Quebec
1 Quebec Quebec
2 Trichy TN
2 Madurai TN
3 Bangalore KN
3 Mysore KN
3 Hosur KN
Expected output from join:
ID string address state
1 sfafsda Montreal Quebec
2 trwe Trichy TN
3 gfdgsd Bangalore KN
As I am working on databricks, please let me know whether it's easier to implement pyspark left join only with the first row or sql join is possible to achieve the expected output. Thanks.
Upvotes: 2
Views: 2995
Reputation: 42352
Yes it's possible using pyspark, but you need to add an index column to df2
. See the code below:
df2 = df2.withColumn('index', F.monotonically_increasing_id())
df1.join(df2, 'ID', 'left') \
.select('*', F.first(F.array('address', 'state')).over(Window.partitionBy('ID').orderBy('index')).alias('array')) \
.select('ID', 'string', F.col('array')[0].alias('address'), F.col('array')[1].alias('state')) \
.groupBy('ID', 'string') \
.agg(F.first('address'), F.first('state')) \
.orderBy('ID')
Upvotes: 4