jovicbg
jovicbg

Reputation: 1553

Join two dataframes in pyspark by one column

I have a two dataframes that I need to join by one column and take just rows from the first dataframe if that id is contained in the same column of second dataframe:

df1:

  id    a     b
  2     1     1
  3    0.5    1
  4     1     2
  5     2     1

df2:

 id      c    d
  2      fs   a
  5      fa   f

Desired output:

df:
  id   a   b
   2   1   1
   5   2   1

I have tried with df1.join(df2("id"),"left"), but gives me error :'Dataframe' object is not callable.

Upvotes: 1

Views: 11491

Answers (2)

Shantanu Sharma
Shantanu Sharma

Reputation: 4099

If you need to check if id exists in df2 and does not need any column in your output from df2 then isin() is more efficient solution (This is similar to EXISTS and IN in SQL).

df1 = spark.createDataFrame([(2,1,1) ,(3,5,1,),(4,1,2),(5,2,1)], "id: Int, a : Int , b : Int")

df2 = spark.createDataFrame([(2,'fs','a') ,(5,'fa','f')], ['id','c','d'])

Create df2.id as list and pass it to df1 under isin()

from pyspark.sql.functions import col

df2_list = df2.select('id').rdd.map(lambda row : row[0]).collect()

df1.where(col('id').isin(df2_list)).show()

#+---+---+---+
#| id|  a|  b|
#+---+---+---+
#|  2|  1|  1|
#|  5|  2|  1|
#+---+---+---+

It is reccomended to use isin() IF -

  • You don't need to return data from the refrence dataframe/table

  • You have duplicates in the refrence dataframe/table (JOIN can cause duplicate rows if values are repeated)

  • You just want to check existence of particular value

Upvotes: 2

akuiper
akuiper

Reputation: 215117

df2("id") is not a valid python syntax for selecting columns, you'd either need df2[["id"]] or use select df2.select("id"); For your example, you can do:

df1.join(df2.select("id"), "id").show()

+---+---+---+
| id|  a|  b|
+---+---+---+
|  5|2.0|  1|
|  2|1.0|  1|
+---+---+---+

or:

df1.join(df2[["id"]], "id").show()
+---+---+---+
| id|  a|  b|
+---+---+---+
|  5|2.0|  1|
|  2|1.0|  1|
+---+---+---+

Upvotes: 5

Related Questions