Reputation: 421
I have a problem merging two dataframes
I'm processing a list of 10 dataframe pairs, all created from the same sql database and csv files.
On some pairs merge(df1, df2)
is working correctly but df1.join(df2)
is not. For example, thesea are the subsets from one of the pairs
>>> a
mod:user studentid
2010453 3891583 2010453
2112086 890910 2112086
2222220 201611 2222220
2346979 7084 2346979
2414996 1817436 2414996
2420317 52821 2420317
2438767 884012 2438767
2451924 20815145 2451924
2515531 2115829 2515531
2536751 494565 2536751
2549050 315295 2549050
2549530 0 2549530
2551532 544968 2551532
2551542 213 2551542
2610206 1257038 2610206
2624429 939670 2624429
2630017 6 2630017
2633815 190564 2633815
2633857 1147211 2633857
2634405 1093092 2634405
2641370 2038012 2641370
2644284 658743 2644284
2649427 220230 2649427
2712372 9468 2712372
2714617 1231577 2714617
2718450 3907345 2718450
2732910 0 2732910
2739711 396876 2739711
8200703 166 8200703
9906492 920875 9906492
oscarl 505 oscarl
>>> b
assignment:5 studentid
2010453 70 2010453
2112086 82 2112086
2222220 76 2222220
2346979 NaN 2346979
2414996 88 2414996
2438767 50 2438767
2451924 100 2451924
2515531 50 2515531
2536751 100 2536751
2538371 94 2538371
2549050 100 2549050
2551532 100 2551532
2610206 50 2610206
2624429 100 2624429
2630017 NaN 2630017
2634405 100 2634405
2641370 100 2641370
2644284 100 2644284
2712372 100 2712372
2714617 69 2714617
2718450 100 2718450
2739711 100 2739711
9906492 100 9906492
>>> pd.merge(a, b, left_on="studentid", right_on="studentid", how="inner")
mod:user studentid assignment:5
0 3891583 2010453 70
1 890910 2112086 82
2 201611 2222220 76
3 7084 2346979 NaN
4 1817436 2414996 88
5 884012 2438767 50
6 20815145 2451924 100
7 2115829 2515531 50
8 494565 2536751 100
9 315295 2549050 100
10 544968 2551532 100
11 1257038 2610206 50
12 939670 2624429 100
13 6 2630017 NaN
14 1093092 2634405 100
15 2038012 2641370 100
16 658743 2644284 100
17 9468 2712372 100
18 1231577 2714617 69
19 3907345 2718450 100
20 396876 2739711 100
21 920875 9906492 100
>>> a.join(b, on="studentid", rsuffix="r", how="inner")
Empty DataFrame
Columns: [mod:user, studentid, assignment:5, studentidr]
Index: []
>>>
Now, to make things really strange, on some other pair of dataframes merge(df1, df2)
is not woking but df1.join(df2)
is working.
>>> a
mod:user studentid
2115728 1177712 2115728
2341322 142805 2341322
2447383 1642046 2447383
2510156 141 2510156
2512053 570889 2512053
2527456 12262284 2527456
2529917 11826381 2529917
2533588 183665 2533588
2535922 107131 2535922
2535991 542259 2535991
2543095 11614678 2543095
2548984 225 2548984
2549565 2059072 2549565
2632847 25408938 2632847
2634371 129605 2634371
2714666 755975 2714666
8307654 74576 8307654
>>> b
assignment:5 studentid
2115728 86.67 2115728
2341322 86.67 2341322
2447383 80.00 2447383
2512053 93.33 2512053
2527456 93.33 2527456
2529917 86.67 2529917
2533588 86.67 2533588
2535922 86.67 2535922
2535991 86.67 2535991
2543095 100.00 2543095
2548984 100.00 2548984
2549565 86.67 2549565
2632847 100.00 2632847
2634371 73.33 2634371
2714666 80.00 2714666
8307654 86.67 8307654
>>> pd.merge(a, b, left_on="studentid", right_on="studentid", how="inner")
Empty DataFrame
Columns: [mod:user, studentid, assignment:5]
Index: []
>>> a.join(b, on="studentid", rsuffix="r", how="inner")
mod:user studentid assignment:5 studentidr
2115728 1177712 2115728 86.67 2115728
2341322 142805 2341322 86.67 2341322
2447383 1642046 2447383 80.00 2447383
2512053 570889 2512053 93.33 2512053
2527456 12262284 2527456 93.33 2527456
2529917 11826381 2529917 86.67 2529917
2533588 183665 2533588 86.67 2533588
2535922 107131 2535922 86.67 2535922
2535991 542259 2535991 86.67 2535991
2543095 11614678 2543095 100.00 2543095
2548984 225 2548984 100.00 2548984
2549565 2059072 2549565 86.67 2549565
2632847 25408938 2632847 100.00 2632847
2634371 129605 2634371 73.33 2634371
2714666 755975 2714666 80.00 2714666
8307654 74576 8307654 86.67 8307654
>>>
I really don't have a clue what is happening and which function to use
Upvotes: 3
Views: 8893
Reputation: 421
Thanks! I found the issue..it is the automatic data conversion when reading into data frame.
As you can see, in the first example someone put string 'oscarl' as the studentid and this resulted in whole columnt being threated as string while in the second example there are no records like this, so it is converted to int.
I found this by running the suggested solution which gave me the errror which pointed me into the right direction
pd.concat([a, b], axis=1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
...
...
Exception: ('Cannot have duplicate column names split across dtypes', 'occurred at index assignment:5')
Upvotes: 3
Reputation: 375925
Interestingly I can't reproduce your merge's empty DataFrame on pandas 0.12, seems to work fine.
I would recommend doing a concat here (which works better if they are indexed, but doing have the repeated studentid column). The benefit being that this generalises to multiple DataFrames/Series.
In [11]: df = pd.concat([a, b], axis=1)
In [12]: del df['studentid'] # cleaner if you don't have these as cols in the first place
In [13]: df
Out[13]:
mod:user assignment:5
2115728 1177712 86.67
2341322 142805 86.67
2447383 1642046 80.00
2510156 141 NaN
2512053 570889 93.33
2527456 12262284 93.33
2529917 11826381 86.67
2533588 183665 86.67
2535922 107131 86.67
2535991 542259 86.67
2543095 11614678 100.00
2548984 225 100.00
2549565 2059072 86.67
2632847 25408938 100.00
2634371 129605 73.33
2714666 755975 80.00
8307654 74576 86.67
You can use the join='inner'
argument to not include the NaNs.
Upvotes: 1