Yoni
Yoni

Reputation: 27

Replace the value inside a csv column by value inside parentheses of the same column using python pandas

I got the following csv file with sample data: Small part of the csv file with sample data

Now I want to replace the columns 'SIFT' and 'PolyPhen' values with the data inside the parentheses of these columns. So for row 1 the SIFT value will replace to 0.82, and for row 2 the SIFT value will be 0.85. Also I want the part before the parentheses, tolerated/deleterious, inside a new column named 'SIFT_prediction'.

This is what I tried so far:

import pandas as pd
import re

testfile = 'test_sift_columns.csv'
df = pd.read_csv(testfile)  
df['SIFT'].re.search(r'\((.*?)\)',s).group(1)

This code will take everything inside the parentheses of the column SIFT. But this does not replace anything. I probably need a for loop to read and replace every row but I don't know how to do it correctly. Also I am not sure if using a regular expression is necessary with pandas. Maybe there is a smarter way to resolve my problem.

Upvotes: 0

Views: 111

Answers (2)

jezrael
jezrael

Reputation: 863166

Use Series.str.extract:

df = pd.DataFrame({'SIFT':['tol(0.82)','tol(0.85)','tol(1.42)'],
                   'PolyPhen':['beg(0)','beg(0)','beg(0)']})

pat = r'(.*?)\((.*?)\)'
df[['SIFT_prediction','SIFT']] = df['SIFT'].str.extract(pat)
df[['PolyPhen_prediction','PolyPhen']] = df['PolyPhen'].str.extract(pat)

print(df)
  SIFT_prediction  SIFT PolyPhen_prediction PolyPhen
0             tol  0.82                 beg        0
1             tol  0.85                 beg        0
2             tol  1.42                 beg        0

Alternative:

df[['SIFT_prediction','SIFT']] = df['SIFT'].str.rstrip(')').str.split('(', expand=True)
df[['PolyPhen_prediction','PolyPhen']] = df['PolyPhen'].str.rstrip(')').str.split('(', expand=True)

Upvotes: 2

VnC
VnC

Reputation: 2026

You can do something like replacing all alphanumeric values with empty strings in order to get the float value and the opposite to get the predicion.

import pandas as pd

df = pd.DataFrame({'ID': [1,2,3,4], 'SIFT': ['tolerated(0.82)', 'tolerated(0.85)', 'tolerated(0.25)', 'dedicated(0.5)']})
df['SIFT_formatted'] = df.SIFT.str.replace('[^0-9.]', '', regex=True).astype(float)
df['SIFT_prediction'] = df.SIFT.str.replace('[^a-zA-Z]', '', regex=True)
df

Would give you:

    ID  SIFT            SIFT_formatted  SIFT_prediction
0   1   tolerated(0.82) 0.82             tolerated
1   2   tolerated(0.85) 0.85             tolerated
2   3   tolerated(0.25) 0.25             tolerated
3   4   dedicated(0.5)  0.50             dedicated

Upvotes: 0

Related Questions