Javier
Javier

Reputation: 730

Regex to Skip One Digit and Extract All

Data

  t=  pd.DataFrame({'A': ['3.1 Food', '3.1.1 Bread', '3.1.1.1 Chicken'], 'Val1': [10, 14, 94], 'Val2': [1,2,3], 'Val3' : [100, 120, 130]}, 
                      columns=['A', 'Val1'])

                 A  Val1
0         3.1 Food    10
1      3.1.1 Bread    14
2  3.1.1.1 Chicken    94

Expected Output

I'm trying to use conditional regular expressions to extract values and form a new column, with the output below. I'm only interested in values with the pattern \d{1}.\d{1}.\d{1}

A              Val1   SubCategory
3.1 Food        10        nan
3.1.1 Bread     14    3.1.1 Bread
3.1.1.1 Chicken 94        nan

What I've Tried

t['SubCategory'] = t['A'].str.extract(r'^(\d{1}.\d{1}.\d{1}.*)')

       A        Val1          SubCategory
3.1   Food       10           nan
3.1.1 Bread     14        3.1.1 Bread
3.1.1.1 Chicken 94      3.1.1.1 Chicken

I'm unable to restrict the regex such that it only looks into those with 3.1.1 only. Could someone please enlighten me?

Upvotes: 0

Views: 125

Answers (3)

l'L'l
l'L'l

Reputation: 47169

Using an assertion ^ at the start of the pattern should work:

^((?:\d\.){2}\d)[^.]

Example:

https://regex101.com/r/KucJkp/2

Upvotes: 2

marcos
marcos

Reputation: 4510

Just add a space delimiter at the end:

import pandas as pd


t=  pd.DataFrame({'A': ['3.1 Food', '3.1.1 Bread', '3.1.1.1 Chicken'], 'Val1': [10, 14, 94], 'Val2': [1,2,3], 'Val3' : [100, 120, 130]},
                      columns=['A', 'Val1'])
t['SubCategory'] = t['A'].str.extract(r'^(\d{1}\.\d{1}\.\d{1})\s')

print(t)

                 A  Val1 SubCategory
0         3.1 Food    10         NaN
1      3.1.1 Bread    14       3.1.1
2  3.1.1.1 Chicken    94         NaN

Upvotes: 2

Amadan
Amadan

Reputation: 198324

Per my comment: Notice the circumstances of your desired row: there are three numbers separated by dots, and there is a start of line before, and a space after. You got the start anchor in your line of code, but not the end one.

t['SubCategory'] = t['A'].str.extract(r'^(\d{1}.\d{1}.\d{1} .*)')

(If you just wanted to capture digits in a match, without a space, you would want to use a positive lookahead instead: r'^(\d{1}.\d{1}.\d{1})(?= )')

Upvotes: 1

Related Questions