HilaD
HilaD

Reputation: 881

pyspark merge/join two data frame

I want to merge two data frame.

list 1:

      id        
0   1234
1   456
2   789
3   101112

list 2:

      id    age        
0   456      18
1   101112   52

Desired result:

      id    age     
0   1234     Na
1   456      18
2   789      Na
3   101112   52

tried:

list1.join(list2,list1.id==list2.id,'inner')

error:

ValueError: Can only compare identically-labeled Series objects

tried:

list1.merge(list2)

error:

result is list 2.

what is the problem?

Upvotes: 0

Views: 5920

Answers (2)

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41957

If you have two dataframes as

list1 = sqlContext.createDataFrame([(1234,), (456,), (789,), (101112,)], ["id"])
list1.show()
+------+
|    id|
+------+
|  1234|
|   456|
|   789|
|101112|
+------+

and

list2 = sqlContext.createDataFrame([(456,18), (101112, 52)], ["id", "age"])
list2.show()
+------+---+
|    id|age|
+------+---+
|   456| 18|
|101112| 52|
+------+---+

A left join should give you your desired result

list1.join(list2, "id", "left").show()

+------+----+
|    id| age|
+------+----+
|   789|null|
|  1234|null|
|101112|  52|
|   456|  18|
+------+----+

Upvotes: 2

Silvio
Silvio

Reputation: 4207

You need to perform a left_outer join so it doesn't drop records

list3 = list1.join(list2, 'id', 'left_outer')
list3.show()

+------+----+
|    id| age|
+------+----+
|   789|null|
|  1234|null|
|101112|  52|
|   456|  18|
+------+----+

Upvotes: 3

Related Questions