pyspark left join only with the first record

Question

I have 2 pysaprk dataframes.

I am looking for option to join df1 with df2. Left join only with the first row from df2.

df1:

ID  string
1   sfafsda
2   trwe
3   gfdgsd

df2

ID  address   state
1   Montreal  Quebec
1   Quebec    Quebec
2   Trichy    TN
2   Madurai   TN
3   Bangalore KN
3   Mysore    KN
3   Hosur     KN

Expected output from join:

ID  string   address   state
1   sfafsda  Montreal  Quebec
2   trwe     Trichy    TN
3   gfdgsd   Bangalore KN

As I am working on databricks, please let me know whether it's easier to implement pyspark left join only with the first row or sql join is possible to achieve the expected output. Thanks.

mck · Accepted Answer

Yes it's possible using pyspark, but you need to add an index column to df2. See the code below:

df2 = df2.withColumn('index', F.monotonically_increasing_id())

df1.join(df2, 'ID', 'left') \
   .select('*', F.first(F.array('address', 'state')).over(Window.partitionBy('ID').orderBy('index')).alias('array')) \
   .select('ID', 'string', F.col('array')[0].alias('address'), F.col('array')[1].alias('state')) \
   .groupBy('ID', 'string') \
   .agg(F.first('address'), F.first('state')) \
   .orderBy('ID')

pyspark left join only with the first record

Answers (1)

Related Questions