Reputation: 1815
Input data:
name Age Zodiac Grade City pahun
0 /extract 30 Aries A Aura a_b_c
1 /abc/236466/touchbar.html 20 Leo AB Somerville c_d_e
2 Brenda4 25 Virgo B Hendersonville f_g
3 /abc/256476/mouse.html 18 Libra AA Gannon h_i_j
I am trying to extract the rows based on the regex on the name column. This regex extracts the numbers which has 6 as length.
For example:
/abc/236466/touchbar.html - 236466
Here is the code I have used
df=df[df['name'].str.match(r'\d{6}') == True]
The above line is not matching at all.
Expected:
name Age Zodiac Grade City pahun
0 /abc/236466/touchbar.html 20 Leo AB Somerville c_d_e
1 /abc/256476/mouse.html 18 Libra AA Gannon h_i_j
Can anyone tell me where am I doing wrong?
Upvotes: 3
Views: 8361
Reputation: 626794
str.match
only searches for a match at the start of the string. So, if you want to match /
+ 6 digits + /
somewhere inside the string using str.match
, you would need to use one of
df=df[df['name'].str.match(r'.*/\d{6}/')] # assuming the match is closer to the end of the string
df=df[df['name'].str.match(r'(?s).*/\d{6}/')] # same, but allows a multiline search
df=df[df['name'].str.match(r'.*?/\d{6}/')] # assuming the match is closer to the start of the string
df=df[df['name'].str.match(r'(?s).*?/\d{6}/')] # same, but allows a multiline search
However, it is more reasonable and efficient here to use str.contains
with a regex like
df=df[df['name'].str.contains(r'/\d{6}/')]
to find entries containing /
+ 6 digits + /
.
Or, to make sure you just match 6 digit chunks and not 7+ digit chunks:
df=df[df['name'].str.contains(r'(?<!\d)\d{6}(?!\d)')]
where
(?<!\d)
- makes sure there is no digit on the left\d{6}
- any six digits(?!\d)
- no digit on the right is allowed.Upvotes: 5
Reputation: 21709
You are almost there, use str.contains
instead:
df[df['name'].str.contains(r'\d{6,}')]
Upvotes: 0