Reputation: 632
How can we write regular expression to extract years in texts, years may come in the following forms
Case 1:
1970 - 1980 --> 1970, 1980
January 1920 - Feb 1930 --> 1920, 1930
May 1920 to September 1930 --> 1920, 1930
Case 2:
July 1945 --> 1945
Writing regular expression for Case 1
is easy but how can I tackle Case 2
along with it
\d{4} \s? (?: [^a-zA-Z0-9] | to) \s? \w+? \d{4}
Upvotes: 0
Views: 116
Reputation: 342
for your requirement, just match all 4 digit numbers
import re
s = '''1970 - 1980
January 1920 - Feb 1930
May 1920 to September 1930
July 1945'''
p = re.compile(r'\b\d{4}\b')
s = s.splitlines()
for x in s:
result = p.findall(x)
print(result)
output
['1970', '1980']
['1920', '1930']
['1920', '1930']
['1945']
Upvotes: 2
Reputation: 3405
Regex: .*?([0-9]{4})(?:.*?([0-9]{4}))?
or .*?(\d{4})(?:.*?(\d{4}))?
Details:
()
Capturing group(?:)
Non capturing group{n}
Matches exactly n
times.*?
Matches any char between zero and unlimited times (lazy)Python code:
def Years(text):
return re.findall(r'.*?([0-9]{4})(?:.*?([0-9]{4}))?', text)
print(Years('January 1920 - Feb 1930'))
Output:
[('1920', '1930')]
Upvotes: 0