Reputation: 29
I have a pandas data frame and I would like to make a new column based on two other columns. The data frame looks like this:
REF ALT 30_10 A G 0:0 G A .:. G A .:. A G .:. T C .:. A G 1:1 G A 0:0 C T 1:1 C T 0:0 T A 0:0 T G 0:0 G A 0:0 C T 0:1
I would like a new column that is based on REF ALT and 30_10. The new column will be represented by comparing the 30_10 column to the REF and ALT columns, where 0 in the 30_10 column represents REF and 1 in the 30_10 represents ALT, and . represents a 0. The new column should look like this:
REF ALT 30_10 new_column A G 0:0 A:A G A .:. 0:0 G A .:. 0:0 A G .:. 0:0 T C .:. 0:0 A G 1:1 G:G G A 0:0 G:G C T 1:1 T:T C T 0:0 C:C T A 0:0 T:T T G 0:0 T:T G A 0:0 G:G C T 0:1 C:T
Just wondered if there was a quick pandas or numpy method to do this? I can't seem to find what I need.
Thanks!
Upvotes: 1
Views: 2270
Reputation: 5802
I noticed that you could access the REF
and ALT
columns by their indices 0 and 1, respectively, but that doesn't work with the dot '.'. But if we add a temporary column with zeros at position 2
,
>>> df.insert(loc=2, column='tmp', value='0')
>>> df
REF ALT tmp 30_10
0 A G 0 0:0
1 G A 0 .:.
2 G A 0 .:.
3 A G 0 .:.
4 T C 0 .:.
and replace the '.' with 2
, we can basically work with indices in all three cases, i.e., we can
x
and y
,>>> df['new_column'] = df.apply(lambda row:':'.join(row.iloc[int(i)] for i in row['30_10'].replace('.', '2').split(':')), axis=1)
>>> df = df.drop('tmp', axis=1) # drop the temporary column
>>> df
REF ALT tmp 30_10 new_column
0 A G 0 0:0 A:A
1 G A 0 .:. 0:0
2 G A 0 .:. 0:0
3 A G 0 .:. 0:0
4 T C 0 .:. 0:0
5 A G 0 1:1 G:G
6 G A 0 0:0 G:G
7 C T 0 1:1 T:T
8 C T 0 0:0 C:C
9 T A 0 0:0 T:T
10 T G 0 0:0 T:T
11 G A 0 0:0 G:G
12 C T 0 0:1 C:T
Upvotes: 1
Reputation: 496
I am sure the is a better way to do it other than a nested np.where
statement but it should work in this given example.
import pandas as pd
import numpy as np
# Create dummy data
code = pd.Series(['0:1', '1:0', '0:1', '0:.', '.:1', '0:1'])
ref = pd.Series(['A', 'G', 'A', 'T', 'A', 'T'])
alt = pd.Series(['a', 'b', 'c', 'd', 'e', 'f'])
df = pd.DataFrame()
DUMMY DATA OUTPUT:
code REF ALT
0 0:1 A a
1 1:0 G b
2 0:1 A c
3 0:. T d
4 .:1 A e
5 0:1 T f
Nested np.where
allows you to check for more than two conditions.
# Split the code string into two columns
df[['code_start', 'code_end']] = df['code'].str.split(':', expand=True)
# nested np.where to assign letters based on code for beginning of code column.
df['new_start'] = np.where(df['code_start']=='0', df['REF'], (np.where(df['code_start']=='1', df['ALT'], '0')))
# nested np.where to assign letters based on code for endof code column.
df['new_end'] = np.where(df['code_end']=='0', df['REF'], (np.where(df['code_end']=='1', df['ALT'], '0')))
# Create new code column combining columns as string.
df['new_code'] = df['new_start']+":"+df['new_end']
FINAL OUTPUT:
code REF ALT code_start code_end new_start new_end new_code
0 0:1 A a 0 1 A a A:a
1 1:0 G b 1 0 b G b:G
2 0:1 A c 0 1 A c A:c
3 0:. T d 0 . T 0 T:0
4 .:1 A e . 1 0 e 0:e
5 0:1 T f 0 1 T f T:f
From here you can just drop the extra columns.
I am sure there is a more concise way of doing it, maybe using re.sub
but this is one method.
Upvotes: 2