user7437554
user7437554

Reputation:

Iterating over elements in pandas dataframe-Error

I am trying to select rows by looking up keywords through columns. What I have (not very important info) is a database.txt file like this:

Source  Reference_id    Method  Evaluation  EC value    Enzyme  RxnString   Reaction        Kprime  Temp        pH  
ho  07LIN/ALG   s   A   1.1.1.87    homoisocitrate dehydrogenase    C05662 + C00003 = C00322 + C00288 + C00004  "(1R,2S)-1-hydroxybutane-1,2,4-tricarboxylate(aq) + NAD(ox) = 2-oxoadipate(aq) + carbon dioxide(aq) + NAD(red)"     0.45    298.15      7.5 
as  63GRE   s   C   3.5.4.9 methenyltetrahydrofolate cyclohydrolase C00445 + C00001 = C00234    "5,10-methenyltetrahydrofolate(aq) + H2O(l) = 10-formyltetrahydrofolate(aq)"        4.2 298.15      6.5 
H/DEN_1165  67ENG/DEN   s   B   4.2.1.3 aconitate hydratase C00311 = C00158 isocit(aq) = cit(aq)        18  310.15      7.3 2.96
G/DEN_1165  67ENG/DEN   s   B   4.2.1.3 aconitate hydratase C00311 = C00158 isocit(aq) = cit(aq)        25  310.15      7.3 2.8
p C00002 + C00085 = C00008 + C00354 "ATP(aq) + D-p6p(aq) = ADP(aq) + D-fructose 1,6-bisphosphate(aq)"       2900    303.15      8   
H/OEH_1552  99HUT/OEH   calorimetry B   3.2.1.26    #NAME?  C00089 + C00001 = C00031 + C00095   sucrose(aq) + H2O(l) = D-glucose(aq) + D-fructose(aq)           298.15      4.6 

The code is:

import pandas as pd
import numpy as np
data = pd.read_csv('database.txt', sep="\t")
data 

This works.

Then I need those rows in column "Reactions" where keywords ATP, ADP or AMP appear, so:

data.loc[data['Reaction'].isin(['ATP(aq)','ADP(aq)', 'AMP(aq)'])]

but get


KeyError                                  Traceback (most recent call last)
/usr/lib/python3/dist-packages/pandas/core/indexing.py in _has_valid_type(self, key, axis)
   1410                 if key not in ax:
-> 1411                     error()
   1412             except TypeError as e:

Any help with this problem?

Upvotes: 1

Views: 49

Answers (1)

jezrael
jezrael

Reputation: 862611

I believe you need contains for check substrings:

import re

L = ['ATP(aq)','ADP(aq)','AMP(aq)']
pat = '|'.join(['{}'.format(re.escape(c)) for c in L])

df = data.loc[data['Reaction'].str.contains(pat)]

EDIT:

If need remove duplicates by column Reaction add drop_duplicates:

data = pd.DataFrame({'Reaction': {0: 'ATP(aq) + D-p6p(aq) = ADP(aq) + D-fructose 1,6-bisphosphate(aq)', 
                                  1: '5,10-methenyltetrahydrofolate(aq) + H2O(l) = 10-formyltetrahydrofolate(aq)',
                                  2: 'isocitrate(aq) = citrate(aq)', 
                                  3: 'ATP(aq) + D-p6p(aq) = ADP(aq) + D-fructose 1,6-bisphosphate(aq)'}})
print (data)
                                            Reaction
0  ATP(aq) + D-p6p(aq) = ADP(aq) + D-fructose 1,6...
1  5,10-methenyltetrahydrofolate(aq) + H2O(l) = 1...
2                       isocitrate(aq) = citrate(aq)
3  ATP(aq) + D-p6p(aq) = ADP(aq) + D-fructose 1,6...

import re

L = ['ATP(aq)','ADP(aq)','AMP(aq)']
pat = '|'.join(['{}'.format(re.escape(c)) for c in L])
df1 = data.loc[data['Reaction'].str.contains(pat)]
print (df1)
                                            Reaction
0  ATP(aq) + D-p6p(aq) = ADP(aq) + D-fructose 1,6...
3  ATP(aq) + D-p6p(aq) = ADP(aq) + D-fructose 1,6...

df2 = data.loc[data['Reaction'].str.contains(pat)].drop_duplicates('Reaction')
print (df2)
                                            Reaction
0  ATP(aq) + D-p6p(aq) = ADP(aq) + D-fructose 1,6...

Upvotes: 1

Related Questions