Joining two df with condition in python

Question

I have two df. df1 has over 2 milion rows and has complete data. I'd like to join data from df2, which has over 70.000 rows but it's structure is a bit complicated. df1 has for eac row keys KO-STA and KO-PAR. df2 has in some cases data only on KO-STA, in some cases only on KO-PAR and in some case on both. I'd like to merge those two df and get the data on Need1 and Need2.

df1

          STA_SID   DST_SID        CC   KO_SIFKO    KO-STA       KO-PAR
135  10021582  28878502            NaN       634    634-83        537-780/9   
117  10028732  29999540            NaN       657  657-1729        537-780/4
117  10028732  29999541            NaN       657  657-1729        537-780/4
117  10028732  29999542            NaN       657  657-1729        537-780/4
117  10028732  29999543            NaN       657  657-1729        537-780/4
117  10028732  31356572            NaN       657  657-1729        537-780/4

df2

        KO-STA  STA-PAR       KO-PAR          Need1       Need2  \
0   1976-_    366/2   1976-366/2                Bio       49.500000   
1    991-_  329/128  991-329/128                 PH      184.399994   
2  2147---    96/19   2147-96/19                Win        8.850000   
3   2048-_    625/4   2048-625/4                SSE        4.940000   
4   2194-_    285/3   2194-285/3               TI f      163.000000   
5  2386---     97/1    2386-97/1                Bio       49.500000   
6   2002-_   2002/9  2002-2002/9                Win       12.850000   
7   1324-_       62      1324-62                Win        8.850000   
8   1625-_    980/1   1625-980/1                Win        8.850000   
9   1625-_    980/1   1625-980/1                Bio       49.500000

My attempt was with the following code

GURS_ES1 = pd.merge(df1.reset_index(), df2.reset_index(), on = 'KO-STA')
GURS_ES2 = pd.merge(GURS_ES1.reset_index(), df2.reset_index(), on = 'KO-PAR')

But after the first merge, GURS_ES1 has two indexes KO-PAR_x and KO-PAR_y and it doesn't join them as one column. Any recommendations?

pythonic833 · Accepted Answer

I provide you an example to make sure how you can proceed an what is the reason for the behaviour you observed:

First, let's construct our sample data

df1 = pd.DataFrame(np.random.randint(1,3,size=(3,3)),columns=['a1','x1','x2'])

Output

   a1   x1  x2
0   1   2   1
1   2   1   1
2   1   2   2

Now, the other dataframe

df2 = pd.DataFrame(np.random.randint(1,3,size=(3,3)),columns=['a2','x1','x2'])

    a2  x1  x2
0   2   2   1
1   1   2   2
2   1   1   2

Now, if we merge on only(!) one of the indices which occur in both dataframes, then pandas wants you to be able to reconstruct from which dataframe the index originally came

pd.merge(df1,df2, on='x1')

Output

   a1  x1 x2_x a2 x2_y
0   1   2   1   2   1
1   1   2   1   1   2
2   1   2   2   2   1
3   1   2   2   1   2
4   2   1   1   1   2

Now, the easiest way to get rid of this is to drop one of the double occuring columns in one of the dataframes:

pd.merge(df1[df1.columns.drop('x2')], df2, on='x1')

Output

    a1  x1  a2  x2
0   1   2   2   1
1   1   2   1   2
2   1   2   2   1
3   1   2   1   2
4   2   1   1   2

But you could also merge on a list of columns. Note that we perform an inner join here, which can reduce the number of rows in the output dataframa significantly (or even lead to empty dataframes if there are no matches on both columns)

pd.merge(df1,df2, on=['x1','x2'])

    a1  x1  x2  a2
0   1   2   1   2
1   1   2   2   1

Joining two df with condition in python

Answers (1)

Related Questions