Reputation: 103
I have a column in a pandas data frame called sample_id. Each entry contains a string, from this string I'd like to pull a numeric pattern that will have one of two forms
1-234-5-6789
or
123-4-5648
I'm having trouble defining the correct regex pattern for this. So far I have been experimenting with the following:
re.findall(pattern=r'\b2\w+', string=str(data['sample_id']))
But this is only pulling values that are starting with 2 and only the first chunk of the numeric pattern. How do I express the above patterns with the dashes?
Upvotes: 1
Views: 3526
Reputation: 163352
You could match an optional part (?:\d-)?
to match 1 digit and a hypen, followed by \d{3}-\d-\d{4}
which will match the pattern of the digits for both the examples.
(?:\d-)?\d{3}-\d-\d{4}
Instead of using a word boundary \b
, if there can not be a non whitespace character before your value, you could prepend the regex with (?<!\S)
and if there can not be a non whitespace character after you could add (?!\S)
at the end.
Upvotes: 1
Reputation: 61289
A vertical pipe |
makes an OR in a regular expression, so you can use:
test1='123-4-5648'
test2='1-234-5-6789'
re.findall(pattern=r'[0-9]-[0-9]{3}-[0-9]-[0-9]{4}|[0-9]{3}-[0-9]-[0-9]{4}', string=test1)
re.findall(pattern=r'[0-9]-[0-9]{3}-[0-9]-[0-9]{4}|[0-9]{3}-[0-9]-[0-9]{4}', string=test2)
[0-9]
matches a single digit in the range 0
through 9
(inclusive), {4}
indicates that four such digits should occur in a row, -
means a hyphen, and |
means an OR and separates the two patterns you mention.
Upvotes: 1
Reputation: 166
If there will only a maximum of one hyphen between two numbers then, ^[0-9]+(-[0-9]+)+$
would work well. It uses the normal*(special normal*)*
pattern where normal
is [0-9]
and special
is -
.
Upvotes: 0