Reputation: 419
I have two df.
df1
has over 2 milion rows and has complete data. I'd like to join data from df2
, which has over 70.000 rows but it's structure is a bit complicated.
df1 has for eac row keys KO-STA
and KO-PAR
.
df2 has in some cases data only on KO-STA
, in some cases only on KO-PAR
and in some case on both.
I'd like to merge those two df and get the data on Need1
and Need2
.
df1
STA_SID DST_SID CC KO_SIFKO KO-STA KO-PAR
135 10021582 28878502 NaN 634 634-83 537-780/9
117 10028732 29999540 NaN 657 657-1729 537-780/4
117 10028732 29999541 NaN 657 657-1729 537-780/4
117 10028732 29999542 NaN 657 657-1729 537-780/4
117 10028732 29999543 NaN 657 657-1729 537-780/4
117 10028732 31356572 NaN 657 657-1729 537-780/4
df2
KO-STA STA-PAR KO-PAR Need1 Need2 \
0 1976-_ 366/2 1976-366/2 Bio 49.500000
1 991-_ 329/128 991-329/128 PH 184.399994
2 2147--- 96/19 2147-96/19 Win 8.850000
3 2048-_ 625/4 2048-625/4 SSE 4.940000
4 2194-_ 285/3 2194-285/3 TI f 163.000000
5 2386--- 97/1 2386-97/1 Bio 49.500000
6 2002-_ 2002/9 2002-2002/9 Win 12.850000
7 1324-_ 62 1324-62 Win 8.850000
8 1625-_ 980/1 1625-980/1 Win 8.850000
9 1625-_ 980/1 1625-980/1 Bio 49.500000
My attempt was with the following code
GURS_ES1 = pd.merge(df1.reset_index(), df2.reset_index(), on = 'KO-STA')
GURS_ES2 = pd.merge(GURS_ES1.reset_index(), df2.reset_index(), on = 'KO-PAR')
But after the first merge, GURS_ES1 has two indexes KO-PAR_x
and KO-PAR_y
and it doesn't join them as one column. Any recommendations?
Upvotes: 0
Views: 133
Reputation: 3224
I provide you an example to make sure how you can proceed an what is the reason for the behaviour you observed:
First, let's construct our sample data
df1 = pd.DataFrame(np.random.randint(1,3,size=(3,3)),columns=['a1','x1','x2'])
Output
a1 x1 x2
0 1 2 1
1 2 1 1
2 1 2 2
Now, the other dataframe
df2 = pd.DataFrame(np.random.randint(1,3,size=(3,3)),columns=['a2','x1','x2'])
a2 x1 x2
0 2 2 1
1 1 2 2
2 1 1 2
Now, if we merge on only(!) one of the indices which occur in both dataframes, then pandas wants you to be able to reconstruct from which dataframe the index originally came
pd.merge(df1,df2, on='x1')
Output
a1 x1 x2_x a2 x2_y
0 1 2 1 2 1
1 1 2 1 1 2
2 1 2 2 2 1
3 1 2 2 1 2
4 2 1 1 1 2
Now, the easiest way to get rid of this is to drop one of the double occuring columns in one of the dataframes:
pd.merge(df1[df1.columns.drop('x2')], df2, on='x1')
Output
a1 x1 a2 x2
0 1 2 2 1
1 1 2 1 2
2 1 2 2 1
3 1 2 1 2
4 2 1 1 2
But you could also merge on a list of columns. Note that we perform an inner join here, which can reduce the number of rows in the output dataframa significantly (or even lead to empty dataframes if there are no matches on both columns)
pd.merge(df1,df2, on=['x1','x2'])
a1 x1 x2 a2
0 1 2 1 2
1 1 2 2 1
Upvotes: 1