Reputation: 19
I have thousands of datasets from where I am interested in extracting the year which preceded a month. For example:
In dataset 1: September 1980
In dataset 2: October, 1978
The regular expression that I wrote using https://regex101.com/:
^(?<month>)\w+(\1)\s[0-9]{4}$|(^(?<fmonth>)\w+,\s[0-9]{4}$)
It does do the job using the link. However, when I tried to use it in my python code, I was getting the below error:
File "<ipython-input-216-a995358d0957>", line 1, in <module>
runfile('C:/Users/Muntabir/nltk_data/corpora/cookbook/clean_data/text-classification_year(clean).py', wdir='C:/Users/Muntabir/nltk_data/corpora/cookbook/clean_data')
File "C:\Users\Muntabir\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
execfile(filename, namespace)
File "C:\Users\Muntabir\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/Muntabir/nltk_data/corpora/cookbook/clean_data/text-classification_year(clean).py", line 76, in <module>
year_data = re.findall('^(?<month>)\w+(\1)\s[0-9]{4}$|(^(?<fmonth>)\w+,\s[0-9]{4}$)', tokenized_string)
File "C:\Users\Muntabir\Anaconda3\lib\re.py", line 222, in findall
return _compile(pattern, flags).findall(string)
File "C:\Users\Muntabir\Anaconda3\lib\re.py", line 301, in _compile
p = sre_compile.compile(pattern, flags)
File "C:\Users\Muntabir\Anaconda3\lib\sre_compile.py", line 562, in compile
p = sre_parse.parse(p, flags)
File "C:\Users\Muntabir\Anaconda3\lib\sre_parse.py", line 855, in parse
p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
File "C:\Users\Muntabir\Anaconda3\lib\sre_parse.py", line 416, in _parse_sub
not nested and not items))
File "C:\Users\Muntabir\Anaconda3\lib\sre_parse.py", line 691, in _parse
len(char) + 2)
error: unknown extension ?<m
I am not sure why it is causing this error. Can anyone provide me with an explanation with a possible solution? Your help would be much appreciated.
Thanks
Upvotes: 0
Views: 142
Reputation: 19
I really appreciate all of your contributions. But @Joan Lara Ganau's solution provided me with a guideline what the regexp could be. @Joan, your regexp is going to match if any year preceded with a month and a date. Also, it does not search for a comma and space. As I mentioned that I have thousands of datasets from where I exactly want to extract a year which preceded with a month. I was looking for the following format:
a.) Month Year b.) Month, Year
Anyway, I found the solution to my problem set after doing a number of experiments. The solution is:
year_result = re.compile(
r"(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|"
"Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|"
"Dec(ember)?)(,?)(\s\d{4})")
Also, the match() method will also return None if the pattern does not get matched. In that case, using the group() method will throw an AttributeError. The error is something like None type object does not have a matching group(). So, I fixed it in the following manner:
def matched(document):
year = year_result.match(document)
year = year_result.search(document)
if year is None:
return '0'
return year.group(14)
Now you can pass the text document from where you want to extract the year to the above function.
Thanks
Upvotes: 1
Reputation: 91385
A named capture group is: (?P<name>...)
not .(?<name>...)
Use: ^(?P<month>\w+),?\s[0-9]{4}$
Upvotes: 0
Reputation: 1396
import re
year = re.compile(r'(\b\d{1,2}\D{0,3})?\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?)\D?\D?(\d{1,4})')
print(year.match('September 1980').group(3))
print(year.match('October, 1978').group(3))
Output:
1980
1978
Upvotes: 0