Reputation: 83
I have this data frame (columns are strings):
ORF ORFDesc
3 b1731 succinate-semialdehyde dehydrogenase
4 b234 succinate-semialdehyde dehydrogenase
24 b2780 L-alanine dehydrogenase
27 b753 methylmalmonate semialdehyde dehydrogenase
29 b1187 pyrroline-5-carboxylate dehydrogenase
...............................................................
1922 b1124 probable epoxide hydrolase
1923 b2214 probable epoxide hydrolase
1924 b3670 probable epoxide hydrolase
1925 b134 probable epoxide hydrolase
2382 b2579 1,3,4,6-tetrachloro-1,4-cyclohexadiene hydrolase
I need to get 'ORF'
values for rows with 'ORFDesc'
that contains a word with "hydro" but only with 13 characters. I explain, word length must be 13 characters, not the whole description.
I'm using
df['IDClass'][df['ORFDesc'].str.contains("hydro", na=False)]
In order to match the rows that contain "hydro" but I need to reject the ones with length != 13.
I would like to use a regex so I can make a new Column 'word' like:
ORF ORFDesc word
3 b1731 succinate-semialdehyde dehydrogenase dehydrogenase
4 b234 succinate-semialdehyde dehydrogenase dehydrogenase
24 b2780 L-alanine dehydrogenase dehydrogenase
27 b753 methylmalmonate semialdehyde dehydrogenase .
29 b1187 pyrroline-5-carboxylate dehydrogenase .
...............................................................
1922 b1124 probable epoxide hydrolase hydrolase
1923 b2214 probable epoxide hydrolase hydrolase
1924 b3670 probable epoxide hydrolase ....
1925 b134 probable epoxide hydrolase ..
2382 b2579 1,3,4,6-tetrachloro-1,4-cyclohexadiene hydrolase .
And then be able to discard rows by using length in 'word' column.
What pattern will it be?
EDIT:
I have tryed this but still dont work:
pattern = '\b(?=\w*hydro)\w+\b'
Upvotes: 3
Views: 1147
Reputation: 626689
You can use
\b(?=\w{13}\b)\w*hydro
See the regex demo
Details
\b
- a word boundary(?=\w{13}\b)
- a positive lookahead that requires 13 word chars to be present immediately to the right of the current location followed with a word boundary\w*hydro
- zero or more word chars and then hydro
.Python code:
df['ORF'][df['ORFDesc'].str.contains(r"\b(?=\w{13}\b)\w*hydro", na=False)]
Upvotes: 2
Reputation: 214927
If you looking for a boolean series to tell if it's matching or not, you can use \b(?=\w{13}\b)(?=\w*hydro)
, which will tell if the word is 13 characters and contains a hydro
pattern:
df.ORFDesc.str.contains(r'\b(?=\w{13}\b)(?=\w*hydro)')
#3 True
#4 True
#24 True
#27 True
#29 True
#1922 False
#1923 False
#1924 False
#1925 False
#2382 False
#Name: ORFDesc, dtype: bool
Upvotes: 0