merkle
merkle

Reputation: 1815

regular expression using pandas string match

Input data:

                        name  Age Zodiac Grade            City  pahun
0                   /extract   30  Aries     A            Aura  a_b_c
1  /abc/236466/touchbar.html   20    Leo    AB      Somerville  c_d_e
2                    Brenda4   25  Virgo     B  Hendersonville    f_g
3     /abc/256476/mouse.html   18  Libra    AA          Gannon  h_i_j

I am trying to extract the rows based on the regex on the name column. This regex extracts the numbers which has 6 as length.

For example:
/abc/236466/touchbar.html  - 236466

Here is the code I have used

df=df[df['name'].str.match(r'\d{6}') == True]

The above line is not matching at all.

Expected:

                         name  Age Zodiac Grade            City  pahun
0  /abc/236466/touchbar.html   20    Leo    AB      Somerville  c_d_e
1     /abc/256476/mouse.html   18  Libra    AA          Gannon  h_i_j

Can anyone tell me where am I doing wrong?

Upvotes: 3

Views: 8361

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626794

str.match only searches for a match at the start of the string. So, if you want to match / + 6 digits + / somewhere inside the string using str.match, you would need to use one of

df=df[df['name'].str.match(r'.*/\d{6}/')]      # assuming the match is closer to the end of the string
df=df[df['name'].str.match(r'(?s).*/\d{6}/')]  # same, but allows a multiline search
df=df[df['name'].str.match(r'.*?/\d{6}/')]     # assuming the match is closer to the start of the string
df=df[df['name'].str.match(r'(?s).*?/\d{6}/')] # same, but allows a multiline search

However, it is more reasonable and efficient here to use str.contains with a regex like

df=df[df['name'].str.contains(r'/\d{6}/')]

to find entries containing / + 6 digits + /.

Or, to make sure you just match 6 digit chunks and not 7+ digit chunks:

df=df[df['name'].str.contains(r'(?<!\d)\d{6}(?!\d)')]

where

  • (?<!\d) - makes sure there is no digit on the left
  • \d{6} - any six digits
  • (?!\d) - no digit on the right is allowed.

Upvotes: 5

YOLO
YOLO

Reputation: 21709

You are almost there, use str.contains instead:

df[df['name'].str.contains(r'\d{6,}')]

Upvotes: 0

Related Questions