genome_juice
genome_juice

Reputation: 29

pandas new column based on information in two other columns

I have a pandas data frame and I would like to make a new column based on two other columns. The data frame looks like this:

REF ALT 30_10
A   G   0:0
G   A   .:.
G   A   .:.
A   G   .:.
T   C   .:.
A   G   1:1
G   A   0:0
C   T   1:1
C   T   0:0
T   A   0:0
T   G   0:0
G   A   0:0
C   T   0:1

I would like a new column that is based on REF ALT and 30_10. The new column will be represented by comparing the 30_10 column to the REF and ALT columns, where 0 in the 30_10 column represents REF and 1 in the 30_10 represents ALT, and . represents a 0. The new column should look like this:

REF ALT 30_10 new_column
A   G   0:0  A:A
G   A   .:.  0:0
G   A   .:.  0:0
A   G   .:.  0:0
T   C   .:.  0:0 
A   G   1:1  G:G
G   A   0:0  G:G
C   T   1:1  T:T
C   T   0:0  C:C
T   A   0:0  T:T
T   G   0:0  T:T
G   A   0:0  G:G
C   T   0:1  C:T

Just wondered if there was a quick pandas or numpy method to do this? I can't seem to find what I need.

Thanks!

Upvotes: 1

Views: 2270

Answers (2)

fsimonjetz
fsimonjetz

Reputation: 5802

I noticed that you could access the REF and ALT columns by their indices 0 and 1, respectively, but that doesn't work with the dot '.'. But if we add a temporary column with zeros at position 2,

>>> df.insert(loc=2, column='tmp', value='0')
>>> df
  REF ALT tmp 30_10
0   A   G   0   0:0
1   G   A   0   .:.
2   G   A   0   .:.
3   A   G   0   .:.
4   T   C   0   .:.

and replace the '.' with 2, we can basically work with indices in all three cases, i.e., we can

  • split at ':' to get indices x and y,
  • get the content of the xth/yth column,
  • and rejoin.
>>> df['new_column'] = df.apply(lambda row:':'.join(row.iloc[int(i)] for i in row['30_10'].replace('.', '2').split(':')), axis=1)
>>> df = df.drop('tmp', axis=1) # drop the temporary column
>>> df
   REF ALT tmp 30_10 new_column
0    A   G   0   0:0        A:A
1    G   A   0   .:.        0:0
2    G   A   0   .:.        0:0
3    A   G   0   .:.        0:0
4    T   C   0   .:.        0:0
5    A   G   0   1:1        G:G
6    G   A   0   0:0        G:G
7    C   T   0   1:1        T:T
8    C   T   0   0:0        C:C
9    T   A   0   0:0        T:T
10   T   G   0   0:0        T:T
11   G   A   0   0:0        G:G
12   C   T   0   0:1        C:T

Upvotes: 1

bkeesey
bkeesey

Reputation: 496

I am sure the is a better way to do it other than a nested np.where statement but it should work in this given example.

import pandas as pd
import numpy as np
# Create dummy data
code = pd.Series(['0:1', '1:0', '0:1', '0:.', '.:1', '0:1'])
ref = pd.Series(['A', 'G', 'A', 'T', 'A', 'T'])
alt = pd.Series(['a', 'b', 'c', 'd', 'e', 'f'])

df = pd.DataFrame()

DUMMY DATA OUTPUT:

 code REF ALT
0  0:1   A   a
1  1:0   G   b
2  0:1   A   c
3  0:.   T   d
4  .:1   A   e
5  0:1   T   f

Nested np.where allows you to check for more than two conditions.

# Split the code string into two columns
df[['code_start', 'code_end']]  = df['code'].str.split(':', expand=True)

# nested np.where to assign letters based on code for beginning of code column.
df['new_start'] = np.where(df['code_start']=='0', df['REF'], (np.where(df['code_start']=='1', df['ALT'], '0')))

# nested np.where to assign letters based on code for endof code column.
df['new_end'] = np.where(df['code_end']=='0', df['REF'], (np.where(df['code_end']=='1', df['ALT'], '0')))

# Create new code column combining columns as string.
df['new_code'] = df['new_start']+":"+df['new_end']

FINAL OUTPUT:

  code REF ALT code_start code_end new_start new_end new_code
0  0:1   A   a          0        1         A       a      A:a
1  1:0   G   b          1        0         b       G      b:G
2  0:1   A   c          0        1         A       c      A:c
3  0:.   T   d          0        .         T       0      T:0
4  .:1   A   e          .        1         0       e      0:e
5  0:1   T   f          0        1         T       f      T:f

From here you can just drop the extra columns.

I am sure there is a more concise way of doing it, maybe using re.sub but this is one method.

Upvotes: 2

Related Questions