Reputation: 1185
I have a problem with pandas and replacing values. I have a table with animals and their alleles looks that:
Name User A1_Top A2_Top
stefan1 721 A C
stefan2 721 A G
stefan3 331 T T
stefan4 331 C G
stefan5 331 A A
stefan6 721 G G
And I need to change values of Top1 and Top2 by a specific Key for each row.
For example: if values in same rows will be = C & A, I will replace it to A & B, if row == TT will be BB
etc. (key is in if/else below).
I got an answer in another post how to do it by dictionary but I can't handle with that double condition (if it will be one condition for ex. if A in first row, replace to B, it will be ok). So I just put it into if/else loop and it works... I mean worked, it worked until files were large. Now it soo slow. 300mb file can be processing 30 min on a standard desktop.
That's how my code look now:
def ATCG_to_AB(df):
x = 0
for i in range(lenFor):
if df['A1_TOP'].iloc[i] == 'A' and df['A2_TOP'].iloc[i] =='C':
df['A1_TOP'].iloc[i] = 'A'
df['A2_TOP'].iloc[i] ='B'
elif df['A1_TOP'].iloc[i] == 'A' and df['A2_TOP'].iloc[i] =='G':
df['A1_TOP'].iloc[i] = 'A'
df['A2_TOP'].iloc[i] ='B'
elif df['A1_TOP'].iloc[i] == 'A' and df['A2_TOP'].iloc[i] =='T':
df['A1_TOP'].iloc[i] = 'A'
df['A2_TOP'].iloc[i] ='B'
elif df['A1_TOP'].iloc[i] == 'C' and df['A2_TOP'].iloc[i] =='G':
df['A1_TOP'].iloc[i] = 'A'
df['A2_TOP'].iloc[i] ='B'
elif df['A1_TOP'].iloc[i] == 'A' and df['A2_TOP'].iloc[i] =='A':
df['A1_TOP'].iloc[i] = 'A'
df['A2_TOP'].iloc[i] ='A'
elif df['A1_TOP'].iloc[i] == 'C' and df['A2_TOP'].iloc[i] =='C':
df['A1_TOP'].iloc[i] = 'B'
df['A2_TOP'].iloc[i] ='B'
elif df['A1_TOP'].iloc[i] == 'G' and df['A2_TOP'].iloc[i] =='G':
df['A1_TOP'].iloc[i] = 'B'
df['A2_TOP'].iloc[i] ='B'
elif df['A1_TOP'].iloc[i] == 'T' and df['A2_TOP'].iloc[i] =='T':
df['A1_TOP'].iloc[i] = 'B'
df['A2_TOP'].iloc[i] ='B'
else:
print(x,". Something is wrong in line: ", i)
x+=1
X - counting errors. And I know, that code is ugly, so I tried to put the dictionary method. My attempt:
L = [('A', 'C', 'A', 'B'),('A', 'G', 'A', 'B'),('A', 'T', 'A', 'B'),
('C', 'G', 'A', 'B'),('A', 'A', 'A', 'A'),('C', 'C', 'B', 'B'),
('G', 'G', 'B', 'B'),('T', 'T', 'B', 'B')]
for x in L:
a.loc[(df[2] == x[0]) & (df[3] == x[1]), [2,3]] = [x[2], x[3]]
But I got a bad output. The only A1_top is changed, and usually it a bad symbol. Can someone help me translate my ugly code to the dictionary and explain it? And do I think properly that will be a faster solution?
For sure, expecting output (there are no headers in output, below for clarity)
name User A1_Top A2_Top
stefan1 721 A B
stefan2 721 A B
stefan3 331 B B
stefan4 331 A B
stefan5 331 A A
stefan6 721 B B
Upvotes: 0
Views: 65
Reputation: 587
A simple trick, not to say the best, but it works:
[Create a dummy column to do mapping or df[col].apply
]
df['combined'] = df['A1_Top']+"|"+df['A2_Top']
A1_Top A2_Top combined
0 A C A|C
1 A G A|G
2 T T T|T
3 C G C|G
4 A A A|A
5 G G G|G
Create dictionary, mapping all your requirements: I give 1 here
map_dict = {}
map_dict['A|C'] = 'B|C'
.
.
.
df['new_values'] = df['combined'].apply(lambda x:map_dict[x] if x in map_dict.keys() else x)
A1_Top A2_Top combined new_values
0 A C A|C B|C
1 A G A|G A|G
2 T T T|T T|T
3 C G C|G C|G
4 A A A|A A|A
5 G G G|G G|G
df['new_a1_top'] = df['new_values'].apply(lambda x: x.split('|')[0])
df['new_a2_top'] = df['new_values'].apply(lambda x: x.split('|')[1])
A1_Top A2_Top combined new_values new_a1_top new_a2_top
0 A C A|C B|C B C
1 A G A|G A|G A G
2 T T T|T T|T T T
3 C G C|G C|G C G
4 A A A|A A|A A A
5 G G G|G G|G G G
Upvotes: 2