Thony
Thony

Reputation: 83

Python Pandas Extract word from column that contains String with Regex

I have this data frame (columns are strings):

        ORF                                             ORFDesc
3     b1731              succinate-semialdehyde dehydrogenase
4      b234              succinate-semialdehyde dehydrogenase
24    b2780                             L-alanine dehydrogenase
27     b753          methylmalmonate semialdehyde dehydrogenase
29    b1187               pyrroline-5-carboxylate dehydrogenase
...............................................................                                               
1922  b1124                         probable epoxide hydrolase 
1923  b2214                         probable epoxide hydrolase 
1924  b3670                          probable epoxide hydrolase
1925   b134                          probable epoxide hydrolase
2382  b2579    1,3,4,6-tetrachloro-1,4-cyclohexadiene hydrolase

I need to get 'ORF' values for rows with 'ORFDesc' that contains a word with "hydro" but only with 13 characters. I explain, word length must be 13 characters, not the whole description.

I'm using

df['IDClass'][df['ORFDesc'].str.contains("hydro", na=False)]

In order to match the rows that contain "hydro" but I need to reject the ones with length != 13.

I would like to use a regex so I can make a new Column 'word' like:

ORF                                             ORFDesc                word
3     b1731              succinate-semialdehyde dehydrogenase          dehydrogenase
4      b234              succinate-semialdehyde dehydrogenase          dehydrogenase
24    b2780                             L-alanine dehydrogenase        dehydrogenase
27     b753          methylmalmonate semialdehyde dehydrogenase           .
29    b1187               pyrroline-5-carboxylate dehydrogenase             .
...............................................................                                               
1922  b1124                         probable epoxide hydrolase         hydrolase 
1923  b2214                         probable epoxide hydrolase         hydrolase 
1924  b3670                          probable epoxide hydrolase        ....
1925   b134                          probable epoxide hydrolase         ..
2382  b2579    1,3,4,6-tetrachloro-1,4-cyclohexadiene hydrolase        .

And then be able to discard rows by using length in 'word' column.

What pattern will it be?

EDIT:

I have tryed this but still dont work:

pattern = '\b(?=\w*hydro)\w+\b'

Upvotes: 3

Views: 1147

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626689

You can use

\b(?=\w{13}\b)\w*hydro

See the regex demo

Details

  • \b - a word boundary
  • (?=\w{13}\b) - a positive lookahead that requires 13 word chars to be present immediately to the right of the current location followed with a word boundary
  • \w*hydro - zero or more word chars and then hydro.

Python code:

df['ORF'][df['ORFDesc'].str.contains(r"\b(?=\w{13}\b)\w*hydro", na=False)]

Upvotes: 2

akuiper
akuiper

Reputation: 214927

If you looking for a boolean series to tell if it's matching or not, you can use \b(?=\w{13}\b)(?=\w*hydro), which will tell if the word is 13 characters and contains a hydro pattern:

df.ORFDesc.str.contains(r'\b(?=\w{13}\b)(?=\w*hydro)')

#3        True
#4        True
#24       True
#27       True
#29       True
#1922    False
#1923    False
#1924    False
#1925    False
#2382    False
#Name: ORFDesc, dtype: bool

Upvotes: 0

Related Questions