Reputation:
I am trying to select rows by looking up keywords through columns. What I have (not very important info) is a database.txt
file like this:
Source Reference_id Method Evaluation EC value Enzyme RxnString Reaction Kprime Temp pH
ho 07LIN/ALG s A 1.1.1.87 homoisocitrate dehydrogenase C05662 + C00003 = C00322 + C00288 + C00004 "(1R,2S)-1-hydroxybutane-1,2,4-tricarboxylate(aq) + NAD(ox) = 2-oxoadipate(aq) + carbon dioxide(aq) + NAD(red)" 0.45 298.15 7.5
as 63GRE s C 3.5.4.9 methenyltetrahydrofolate cyclohydrolase C00445 + C00001 = C00234 "5,10-methenyltetrahydrofolate(aq) + H2O(l) = 10-formyltetrahydrofolate(aq)" 4.2 298.15 6.5
H/DEN_1165 67ENG/DEN s B 4.2.1.3 aconitate hydratase C00311 = C00158 isocit(aq) = cit(aq) 18 310.15 7.3 2.96
G/DEN_1165 67ENG/DEN s B 4.2.1.3 aconitate hydratase C00311 = C00158 isocit(aq) = cit(aq) 25 310.15 7.3 2.8
p C00002 + C00085 = C00008 + C00354 "ATP(aq) + D-p6p(aq) = ADP(aq) + D-fructose 1,6-bisphosphate(aq)" 2900 303.15 8
H/OEH_1552 99HUT/OEH calorimetry B 3.2.1.26 #NAME? C00089 + C00001 = C00031 + C00095 sucrose(aq) + H2O(l) = D-glucose(aq) + D-fructose(aq) 298.15 4.6
The code is:
import pandas as pd
import numpy as np
data = pd.read_csv('database.txt', sep="\t")
data
This works.
Then I need those rows in column "Reactions" where keywords ATP, ADP or AMP appear, so:
data.loc[data['Reaction'].isin(['ATP(aq)','ADP(aq)', 'AMP(aq)'])]
but get
KeyError Traceback (most recent call last)
/usr/lib/python3/dist-packages/pandas/core/indexing.py in _has_valid_type(self, key, axis)
1410 if key not in ax:
-> 1411 error()
1412 except TypeError as e:
Any help with this problem?
Upvotes: 1
Views: 49
Reputation: 862611
I believe you need contains
for check substrings:
import re
L = ['ATP(aq)','ADP(aq)','AMP(aq)']
pat = '|'.join(['{}'.format(re.escape(c)) for c in L])
df = data.loc[data['Reaction'].str.contains(pat)]
EDIT:
If need remove duplicates by column Reaction
add drop_duplicates
:
data = pd.DataFrame({'Reaction': {0: 'ATP(aq) + D-p6p(aq) = ADP(aq) + D-fructose 1,6-bisphosphate(aq)',
1: '5,10-methenyltetrahydrofolate(aq) + H2O(l) = 10-formyltetrahydrofolate(aq)',
2: 'isocitrate(aq) = citrate(aq)',
3: 'ATP(aq) + D-p6p(aq) = ADP(aq) + D-fructose 1,6-bisphosphate(aq)'}})
print (data)
Reaction
0 ATP(aq) + D-p6p(aq) = ADP(aq) + D-fructose 1,6...
1 5,10-methenyltetrahydrofolate(aq) + H2O(l) = 1...
2 isocitrate(aq) = citrate(aq)
3 ATP(aq) + D-p6p(aq) = ADP(aq) + D-fructose 1,6...
import re
L = ['ATP(aq)','ADP(aq)','AMP(aq)']
pat = '|'.join(['{}'.format(re.escape(c)) for c in L])
df1 = data.loc[data['Reaction'].str.contains(pat)]
print (df1)
Reaction
0 ATP(aq) + D-p6p(aq) = ADP(aq) + D-fructose 1,6...
3 ATP(aq) + D-p6p(aq) = ADP(aq) + D-fructose 1,6...
df2 = data.loc[data['Reaction'].str.contains(pat)].drop_duplicates('Reaction')
print (df2)
Reaction
0 ATP(aq) + D-p6p(aq) = ADP(aq) + D-fructose 1,6...
Upvotes: 1