Reputation: 1713
I am trying to extract a specific pattern of numbers using regular expression in Python 3.7. Below are the 4 possible patterns.
Pattern 1 - The length of this pattern is exactly 10 and cannot start with a zero. These consist of only integers. Ex: '1234567890'
Pattern 2 - The length of this pattern is exactly 11 and can start with a zero. These consist of only integers. Ex: '01234567890'
Pattern 3 - The length of this pattern is exactly 11 and cannot start with a zero. There is one space after the 5th number and all other characters are numbers. Ex: '12345 67890'
Pattern 4 - The length of this pattern is exactly 12 and can start with a zero. There is one space after the 6th number and all other characters are numbers. Ex: '012345 67890'
Note - The example pattern example provided is for representation only. The actual set of numbers in my string can be anything. Example: '2345653340' or '034945 85730' or '000000 00000' or '09876543210'.
Below is what I have been trying to attempt. For some reason, they are not returning the desired results. How do I go about this?
import re
regex = re.compile(r"(\d)?\d\d\d\d\d(\b)?\d\d\d\d\d")
number1 = regex.findall("number is 1234567890") # For Pattern 1 expected output is '1234567890'
number2 = regex.findall("number is 01234567890") # For Pattern 2 expected output is '01234567890'
number3 = regex.findall("number is 12345 67890") # For Pattern 3 expected output is '12345 67890'
number4 = regex.findall("number is 012345 67890") # For Pattern 4 expected output is '012345 67890'
Upvotes: 0
Views: 1038
Reputation: 2005
Between all the regexes given til now, this one seems the easiest to write and fastest to run:
from re import compile
regex = compile(r'\d{11}|[1-9]\d{9}|[1-9]\d{4}\s\d{5}|\d{6}\s\d{5}')
number1 = regex.findall("number is 1234567890")
number2 = regex.findall("number is 01234567890")
number3 = regex.findall("number is 12345 67890")
number4 = regex.findall("number is 012345 67890")
You get the expected results:
>>> number1
'1234567890'
>>> number2
'01234567890'
>>> number3
'12345 67890'
>>> number4
'012345 67890'
Answer from Andrej Kesely does: 80
steps. regex101.com
Answer from The fourth bird does: 44
steps. regex101.com
My answer does: 41
steps. regex101.com.
Upvotes: 1
Reputation: 163207
You could use and alternation to match the different requirements. You could use a word boundary \b
to prevent the number being part of a larger word.
\b(?:\d{6} \d{5}|[1-9]\d{4} \d{5}|[1-9]\d{9}|\d{11})\b
\b
word boundary(?:
Non capturing group
\d{6} \d{5}
Pattern 4 6 times 0-9, space 5 times 0-9|
Or[1-9]\d{4} \d{5}
Pattern 3 1 time 1-9, 4 times 0-9, space, 5 times 0-9|
Or[1-9]\d{9}
Pattern 1 1 times 1-9, 9 times 0-9|
Or\d{11}
Pattern 2 11 times 0-9)
Close group\b
Word boundaryUpvotes: 1
Reputation: 195408
Regex101 (link):
import re
l = ["number is 1234567890",
"number is 01234567890",
"number is 12345 67890",
"number is 012345 67890",
"number is 912345 67890 - dont match",
"number is 02345 67890 - dont match",
"number is 91234567890 - dont match",
"number is 0234567890 - dont match"]
for s in l:
m = re.findall(r'\b0\d{5}\s\d{5}\b|\b[1-9]\d{4}\s\d{5}\b|\b0\d{10}\b|\b[1-9]\d{9}\b', s)
print(m)
Prints:
['1234567890']
['01234567890']
['12345 67890']
['012345 67890']
[]
[]
[]
[]
Upvotes: 1