HagenW
HagenW

Reputation: 11

How to properly assign a pandas dataframe slice to values in another data frame

although I solved this problem already, I was wondering, whether there is a more direct way of achieving my assignment.

import pandas as pd
df1 = pd.DataFrame({'position': ['20', '8000', '8000'],
                   'SNP_ID': ['rs01', 'rs02', 'rs03'],
                   'SNP_ref': ['A', 'C', 'T'],
                   'SNP_alts': ['G', 'T','A,G,']})

df2 = pd.DataFrame({'position': ['400', '8000', '90000'],
                   'SNP_ID': ['', '', ''],
                   'SNP_ref': ['', '', ''],
                   'SNP_alts': ['', '',''],
                   'check_ref':['T','T','A'],
                   'check_alts':['T','G','A'],
                   'other_data': ['xx','yy','zz']})

c1 = ['SNP_ID','SNP_ref','SNP_alts']

for i in range(len(df2)):

    SNVs = df1[df1['position'] == df2['position'].loc[i]]

    if not SNVs.empty:
        df2.loc[df2.index[i],c1] = SNVs.loc[SNVs['SNP_ref'] == df2['check_ref'].loc[i],c1].iloc[0]

        print(df2)

So essentially based on some criteria (more than shown here), I want to assign the values of three columns for a given row (based on some criteria) to three columns in another df. I only got this to work using .tolist().

Is there any more straightforward way of achieving this?

*note: I'm aware that looping over rows in a df is not good practice, but with my knowledge I'm currently unable to come up with a better solution and I have to do more comparisons to decide which rows to copy. For now my dfs are fairly small, so time is not a big issue.

Thanks Hagen

*UPDATE: based on the answers I modified my code again with a more realistic dataset, and got it to work without .tolist()

import pandas as pd
df1 = pd.DataFrame({'position': ['20', '8000', '8000'],
                   'SNP_ID': ['rs01', 'rs02', 'rs03'],
                   'SNP_ref': ['A', 'C', 'T'],
                   'SNP_alts': ['G', 'T','A,G,']})

df2 = pd.DataFrame({'position': ['400', '8000', '90000'],
                   'SNP_ID': ['', '', ''],
                   'SNP_ref': ['', '', ''],
                   'SNP_alts': ['', '',''],
                   'check_ref':['T','T','A'],
                   'check_alts':['T','G','A'],
                   'other_data': ['xx','yy','zz']})

c1 = ['SNP_ID','SNP_ref','SNP_alts']

for i in range(len(df2)):

    SNVs = df1[df1['position'] == df2['position'].loc[i]]

    if not SNVs.empty:
        df2.loc[df2.index[i],c1] = SNVs.loc[SNVs['SNP_ref'] == df2['check_ref'].loc[i],c1].iloc[0]

print(df2)

*Update 2 with additional comparison no checking whether letter ('A', 'T', etc) are matched in *_alts, but SNP_alts can contain multiple sequences seperated by colons (e.g. A,T,G,AA,GG)

import pandas as pd
df1 = pd.DataFrame({'position': ['20', '8000', '8000'],
                   'SNP_ID': ['rs01', 'rs02', 'rs03'],
                   'SNP_ref': ['A', 'C', 'T'],
                   'SNP_alts': ['G', 'T','A,G,']})

df2 = pd.DataFrame({'position': ['400', '8000', '90000'],
                   'SNP_ID': ['', '', ''],
                   'SNP_ref': ['', '', ''],
                   'SNP_alts': ['', '',''],
                   'check_ref':['T','T','A'],
                   'check_alts':['T','G','A'],
                   'other_data': ['xx','yy','zz']})

c1 = ['SNP_ID','SNP_ref','SNP_alts']

for i in range(len(df2)):

    SNVs = df1[df1['position'] == df2['position'].loc[i]]

    if not SNVs.empty:
        bm1 = SNVs['SNP_ref'] == df2['check_ref'].loc[i]
        bm2 = SNVs['SNP_alts'].apply(lambda x: True if df2['check_alts'].loc[i] in x.split(',') else False)

        if len(SNVs.loc[bm1 & bm2,c1])>0:
            df2.loc[df2.index[i],c1] = SNVs.loc[bm1 & bm2,c1].iloc[0]

print(df2)

Upvotes: 1

Views: 137

Answers (1)

jezrael
jezrael

Reputation: 862791

Use DataFrame.update with rename columns for correct match:

c1 = ['SNP_ID','SNP_ref','SNP_alts']
c2 = ['name','ref','alts']
d = dict(zip(c2, c1))

#for align values by column position
df11 = df1.set_index(['position','SNP_ref'])
df22 = df2.set_index(['position','check_ref'])
    
df22.update(df11.rename(columns=d))
df22 = df22.reset_index().reindex(df2.columns, axis=1)
print (df22)

  position SNP_ID SNP_ref SNP_alts check_ref other_data
0      400                                 T         xx
1     8000   rs03                A         T         yy
2    90000                                 A         zz

Upvotes: 1

Related Questions