Reputation: 71
I have dataframe with a column called 'INFO' containing a string delimited by a semicolon like below. I would like to parse the string to return values corresponding to specific substrings and assign it to a new column.
EX. In the table below, I would want to assign the value (after the equals sign) for 'CLNDISDB' to a column of the same name.
I have tried:
df['INFO'].str.split(';',expand=True)
which assigned each string part to a new column but that would still require me to parse each column for the necessary string. Any help would be greatly appreciated.
I can only use python standard libraries
| | INFO |
|----|---------------------------------------------------|
| 0 | AF_ESP=0.00546;AF_EXAC=0.00165;AF_TGP=0.00619;... |
| 1 | AF_ESP=0.00015;AF_EXAC=0.00010;ALLELEID=514926... |
| 2 | ALLELEID=181485;CLNDISDB=MedGen:C4015293,OMIM:... |
| 3 | ALLELEID=514896;CLNDISDB=MedGen:C4015293,OMIM:... |
| 4 | AF_ESP=0.00515;AF_EXAC=0.00831;AF_TGP=0.00339;... |
| 5 | AF_ESP=0.40158;AF_EXAC=0.37025;AF_TGP=0.33886;... |
| 6 | ALLELEID=556509;CLNDISDB=MedGen:C4015293,OMIM:... |
| 7 | ALLELEID=556512;CLNDISDB=MedGen:C4015293,OMIM:... |
| 8 | ALLELEID=171289;CLNDISDB=MedGen:C4015293,OMIM:... |
| 9 | AF_EXAC=0.00001;ALLELEID=171288;CLNDISDB=MedGe... |
| 10 | AF_ESP=0.00038;AF_EXAC=0.00036;AF_TGP=0.00060;... |
| 11 | AF_ESP=0.00987;AF_EXAC=0.00772;AF_TGP=0.01558;... |
| 12 | AF_ESP=0.00070;AF_TGP=0.00080;ALLELEID=446981;... |
| 13 | AF_EXAC=0.00038;ALLELEID=446941;CLNDISDB=MedGe... |
| 14 | AF_EXAC=0.00114;AF_TGP=0.00958;ALLELEID=364282... |
| 15 | ALLELEID=556516;CLNDISDB=MedGen:C3808739,OMIM:... |
| 16 | AF_EXAC=0.00024;ALLELEID=364148;CLNDISDB=MedGe... |
| 17 | ALLELEID=514900;CLNDISDB=MedGen:C3808739,OMIM:... |
| 18 | AF_EXAC=0.42418;AF_TGP=0.28255;ALLELEID=133759... |
| 19 | ALLELEID=364285;CLNDISDB=MedGen:CN169374;CLNDN... |
Upvotes: 1
Views: 774
Reputation: 2696
This was fun, so setting up the dataframe
import re
import pandas as pd
import numpy as np
test={'INFO':['AF_EXAC=0.42418;AF_TGP=-0.28255;ALLELEID=133759', 'foo','AF_EXAC=-0.42418;AF_TGP=0.28255;ALLELEID=133759']}
df=pd.DataFrame(test)
we get
INFO
0 AF_EXAC=0.42418;AF_TGP=-0.28255;ALLELEID=133759
1 foo
2 AF_EXAC=-0.42418;AF_TGP=0.28255;ALLELEID=133759
Now using a lambda function and a conditional we just extract the values for "AF_TGP". Setting up a new column, we extract the number from the regex pattern that surrounds the value. But we only run the regex when it will pass the text, using the in
function in python.
df['a'] = df['INFO'].apply(lambda x: re.search(r'AF_TGP=(?P<number>[+-]?\d+\.\d+);', str(x)).group('number') if 'AF_TGP' in str(x) else np.nan)
this gives the result
INFO a
0 AF_EXAC=0.42418;AF_TGP=-0.28255;ALLELEID=133759 -0.28255
1 foo NaN
2 AF_EXAC=-0.42418;AF_TGP=0.28255;ALLELEID=133759 0.28255
Upvotes: 1
Reputation: 71
Figured it out. Works well, have to clean up the results a bit and remove an extra space and comma.
def my_func(x):
return [value.split("CLNDISDB=",1) for value in x.split(';') if 'CLNDISDB' in value]
df['CLNDISDB'] = [my_func(x) for x in df['INFO'].values]
Upvotes: 0