Lilly
Lilly

Reputation: 988

pyspark left join only with the first record

I have 2 pysaprk dataframes.

I am looking for option to join df1 with df2. Left join only with the first row from df2.

df1:

ID  string
1   sfafsda
2   trwe
3   gfdgsd

df2

ID  address   state
1   Montreal  Quebec
1   Quebec    Quebec
2   Trichy    TN
2   Madurai   TN
3   Bangalore KN
3   Mysore    KN
3   Hosur     KN

Expected output from join:

ID  string   address   state
1   sfafsda  Montreal  Quebec
2   trwe     Trichy    TN
3   gfdgsd   Bangalore KN

As I am working on databricks, please let me know whether it's easier to implement pyspark left join only with the first row or sql join is possible to achieve the expected output. Thanks.

Upvotes: 2

Views: 2995

Answers (1)

mck
mck

Reputation: 42352

Yes it's possible using pyspark, but you need to add an index column to df2. See the code below:

df2 = df2.withColumn('index', F.monotonically_increasing_id())

df1.join(df2, 'ID', 'left') \
   .select('*', F.first(F.array('address', 'state')).over(Window.partitionBy('ID').orderBy('index')).alias('array')) \
   .select('ID', 'string', F.col('array')[0].alias('address'), F.col('array')[1].alias('state')) \
   .groupBy('ID', 'string') \
   .agg(F.first('address'), F.first('state')) \
   .orderBy('ID')

Upvotes: 4

Related Questions