Reputation: 343
I have a dataframe:
df = pd.DataFrame({'REF':list('GCTT'), 'ALT':list('AACG'),
'A1':['0/1','0/1','0/0','0/1'],
'A2':['1/1','0/1','0/1','0/0']})
REF ALT A1 A2
0 G A 0/1 1/1
1 C A 0/1 0/1
2 T C 0/0 0/1
3 T G 0/1 0/0
I want to convert the columns A1 and A2 based on values in the REF and ALT columns. So, columns A1 & A2 in Row 0 should read GA & AA. ie lose the "/" and replace 0s with G and 1s with A. Next, row 1 should replace 0s with C and 1s with A. Then follow the pattern for the next rows, to get:
REF ALT A1 A2
0 G A GA AA
1 C A CA CA
2 T C TT TC
3 T G TG TT
In my data there are hundreds of A columns: A1, A2......An-1, An. So, the solution needs to be replicable across all columns.
Upvotes: 2
Views: 57
Reputation: 2407
I wonder how fast this solution is with your data:
for col in ["A1","A2"]:
df[col]= df[col].str.split("/",expand=True) \
.replace(["0","1"],[df.REF,df.ALT]) \
.agg("".join,axis=1)
df
REF ALT A1 A2
0 G A GA AA
1 C A CA CA
2 T C TT TC
3 T G TG TT
EDIT: Solution 2., Working with indexes:
# helper structs:
ncbscols= ["REF","ALT"]
cols= df.columns.difference(ncbscols)
ii= pd.MultiIndex.from_product([list("ACGT"),list("ACGT"),["0/0","0/1","1/1","1/0"] ])
ser= pd.Series( [t[2].replace("/","").replace("0",t[0]).replace("1",t[1]) for t in ii ], index=ii )
# the main calculation:
for c in cols:
mi= pd.MultiIndex.from_arrays([ df.REF.values,df.ALT.values,df[c].values ])
df[c]= ser[mi].values
ser:
A A 0/0 AA
0/1 AA
1/1 AA
1/0 AA
C 0/0 AA
..
T G 1/0 GT
T 0/0 TT
0/1 TT
1/1 TT
1/0 TT
Length: 64, dtype: object
df:
REF ALT A1 A2
0 G A GA AA
1 C A CA CA
2 T C TT TC
3 T G TG TT
Upvotes: 2
Reputation: 25239
You have only 4 combo of 0
and 1
, so I think you may try np.select
df1 = df.drop(['REF', 'ALT'], axis=1)
#conditions
combo = ['0/0', '0/1', '1/0', '1/1']
conds = [df1.eq(x) for x in combo]
#selections
s00 = (df.REF * 2).to_numpy()[:,None]
s11 = (df.ALT * 2).to_numpy()[:,None]
s01 = (df.REF + df.ALT).to_numpy()[:,None]
df.loc[:, df1.columns.tolist()] = np.select(conds , [s00, s01, s01[:,::-1], s11], np.nan)
Out[260]:
REF ALT A1 A2
0 G A GA AA
1 C A CA CA
2 T C TT TC
3 T G TG TT
Upvotes: 0
Reputation: 14094
I think there might be a better way to do this, but this works. let me know if it's a performance issue.
acols = df.drop(['REF', 'ALT'], axis=1).columns
for i in acols:
df.loc[df[i] == '0/0', i] = df['REF'] * 2
df.loc[df[i] == '0/1', i] = df['REF'] + df['ALT']
df.loc[df[i] == '1/1', i] = df['ALT'] * 2
Another option
for i in acols:
df[i] = df[i].replace(to_replace='0/0', value=df['REF']+df['REF'])
df[i] = df[i].replace(to_replace='0/1', value=df['REF']+df['ALT'])
df[i] = df[i].replace(to_replace='1/1', value=df['ALT']+df['ALT'])
REF ALT A1 A2
0 G A GA AA
1 C A CA CA
2 T C TT TC
3 T G TG TT
Upvotes: 0