Reputation: 730
Data
t= pd.DataFrame({'A': ['3.1 Food', '3.1.1 Bread', '3.1.1.1 Chicken'], 'Val1': [10, 14, 94], 'Val2': [1,2,3], 'Val3' : [100, 120, 130]},
columns=['A', 'Val1'])
A Val1
0 3.1 Food 10
1 3.1.1 Bread 14
2 3.1.1.1 Chicken 94
Expected Output
I'm trying to use conditional regular expressions to extract values and form a new column, with the output below. I'm only interested in values with the pattern \d{1}.\d{1}.\d{1}
A Val1 SubCategory
3.1 Food 10 nan
3.1.1 Bread 14 3.1.1 Bread
3.1.1.1 Chicken 94 nan
What I've Tried
t['SubCategory'] = t['A'].str.extract(r'^(\d{1}.\d{1}.\d{1}.*)')
A Val1 SubCategory
3.1 Food 10 nan
3.1.1 Bread 14 3.1.1 Bread
3.1.1.1 Chicken 94 3.1.1.1 Chicken
I'm unable to restrict the regex such that it only looks into those with 3.1.1 only. Could someone please enlighten me?
Upvotes: 0
Views: 125
Reputation: 47169
Using an assertion ^
at the start of the pattern should work:
^((?:\d\.){2}\d)[^.]
Example:
https://regex101.com/r/KucJkp/2
Upvotes: 2
Reputation: 4510
Just add a space delimiter at the end:
import pandas as pd
t= pd.DataFrame({'A': ['3.1 Food', '3.1.1 Bread', '3.1.1.1 Chicken'], 'Val1': [10, 14, 94], 'Val2': [1,2,3], 'Val3' : [100, 120, 130]},
columns=['A', 'Val1'])
t['SubCategory'] = t['A'].str.extract(r'^(\d{1}\.\d{1}\.\d{1})\s')
print(t)
A Val1 SubCategory
0 3.1 Food 10 NaN
1 3.1.1 Bread 14 3.1.1
2 3.1.1.1 Chicken 94 NaN
Upvotes: 2
Reputation: 198324
Per my comment: Notice the circumstances of your desired row: there are three numbers separated by dots, and there is a start of line before, and a space after. You got the start anchor in your line of code, but not the end one.
t['SubCategory'] = t['A'].str.extract(r'^(\d{1}.\d{1}.\d{1} .*)')
(If you just wanted to capture digits in a match, without a space, you would want to use a positive lookahead instead: r'^(\d{1}.\d{1}.\d{1})(?= )'
)
Upvotes: 1