Reputation: 27
I got the following csv file with sample data:
Now I want to replace the columns 'SIFT' and 'PolyPhen' values with the data inside the parentheses of these columns. So for row 1 the SIFT value will replace to 0.82, and for row 2 the SIFT value will be 0.85. Also I want the part before the parentheses, tolerated/deleterious, inside a new column named 'SIFT_prediction'.
This is what I tried so far:
import pandas as pd
import re
testfile = 'test_sift_columns.csv'
df = pd.read_csv(testfile)
df['SIFT'].re.search(r'\((.*?)\)',s).group(1)
This code will take everything inside the parentheses of the column SIFT. But this does not replace anything. I probably need a for loop to read and replace every row but I don't know how to do it correctly. Also I am not sure if using a regular expression is necessary with pandas. Maybe there is a smarter way to resolve my problem.
Upvotes: 0
Views: 111
Reputation: 863166
Use Series.str.extract
:
df = pd.DataFrame({'SIFT':['tol(0.82)','tol(0.85)','tol(1.42)'],
'PolyPhen':['beg(0)','beg(0)','beg(0)']})
pat = r'(.*?)\((.*?)\)'
df[['SIFT_prediction','SIFT']] = df['SIFT'].str.extract(pat)
df[['PolyPhen_prediction','PolyPhen']] = df['PolyPhen'].str.extract(pat)
print(df)
SIFT_prediction SIFT PolyPhen_prediction PolyPhen
0 tol 0.82 beg 0
1 tol 0.85 beg 0
2 tol 1.42 beg 0
Alternative:
df[['SIFT_prediction','SIFT']] = df['SIFT'].str.rstrip(')').str.split('(', expand=True)
df[['PolyPhen_prediction','PolyPhen']] = df['PolyPhen'].str.rstrip(')').str.split('(', expand=True)
Upvotes: 2
Reputation: 2026
You can do something like replacing all alphanumeric values with empty strings in order to get the float value and the opposite to get the predicion.
import pandas as pd
df = pd.DataFrame({'ID': [1,2,3,4], 'SIFT': ['tolerated(0.82)', 'tolerated(0.85)', 'tolerated(0.25)', 'dedicated(0.5)']})
df['SIFT_formatted'] = df.SIFT.str.replace('[^0-9.]', '', regex=True).astype(float)
df['SIFT_prediction'] = df.SIFT.str.replace('[^a-zA-Z]', '', regex=True)
df
Would give you:
ID SIFT SIFT_formatted SIFT_prediction
0 1 tolerated(0.82) 0.82 tolerated
1 2 tolerated(0.85) 0.85 tolerated
2 3 tolerated(0.25) 0.25 tolerated
3 4 dedicated(0.5) 0.50 dedicated
Upvotes: 0