Muntabir Choudhury
Muntabir Choudhury

Reputation: 19

Extracting year which preceded with a month using Regex python

I have thousands of datasets from where I am interested in extracting the year which preceded a month. For example:

In dataset 1: September 1980

In dataset 2: October, 1978

The regular expression that I wrote using https://regex101.com/:

^(?<month>)\w+(\1)\s[0-9]{4}$|(^(?<fmonth>)\w+,\s[0-9]{4}$)

It does do the job using the link. However, when I tried to use it in my python code, I was getting the below error:

  File "<ipython-input-216-a995358d0957>", line 1, in <module>
    runfile('C:/Users/Muntabir/nltk_data/corpora/cookbook/clean_data/text-classification_year(clean).py', wdir='C:/Users/Muntabir/nltk_data/corpora/cookbook/clean_data')
  File "C:\Users\Muntabir\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
    execfile(filename, namespace)
  File "C:\Users\Muntabir\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)
  File "C:/Users/Muntabir/nltk_data/corpora/cookbook/clean_data/text-classification_year(clean).py", line 76, in <module>
    year_data = re.findall('^(?<month>)\w+(\1)\s[0-9]{4}$|(^(?<fmonth>)\w+,\s[0-9]{4}$)', tokenized_string)
  File "C:\Users\Muntabir\Anaconda3\lib\re.py", line 222, in findall
    return _compile(pattern, flags).findall(string)
  File "C:\Users\Muntabir\Anaconda3\lib\re.py", line 301, in _compile
    p = sre_compile.compile(pattern, flags)
  File "C:\Users\Muntabir\Anaconda3\lib\sre_compile.py", line 562, in compile
    p = sre_parse.parse(p, flags)
  File "C:\Users\Muntabir\Anaconda3\lib\sre_parse.py", line 855, in parse
    p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
  File "C:\Users\Muntabir\Anaconda3\lib\sre_parse.py", line 416, in _parse_sub
    not nested and not items))
  File "C:\Users\Muntabir\Anaconda3\lib\sre_parse.py", line 691, in _parse
    len(char) + 2)
error: unknown extension ?<m

I am not sure why it is causing this error. Can anyone provide me with an explanation with a possible solution? Your help would be much appreciated.

Thanks

Upvotes: 0

Views: 142

Answers (3)

Muntabir Choudhury
Muntabir Choudhury

Reputation: 19

I really appreciate all of your contributions. But @Joan Lara Ganau's solution provided me with a guideline what the regexp could be. @Joan, your regexp is going to match if any year preceded with a month and a date. Also, it does not search for a comma and space. As I mentioned that I have thousands of datasets from where I exactly want to extract a year which preceded with a month. I was looking for the following format:

a.) Month Year b.) Month, Year

Anyway, I found the solution to my problem set after doing a number of experiments. The solution is:

year_result = re.compile(
                    r"(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|"
                    "Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|"
                    "Dec(ember)?)(,?)(\s\d{4})")

Also, the match() method will also return None if the pattern does not get matched. In that case, using the group() method will throw an AttributeError. The error is something like None type object does not have a matching group(). So, I fixed it in the following manner:

def matched(document):                   
         year = year_result.match(document)
         year = year_result.search(document)
         if year is None:
               return '0'
         return year.group(14)

Now you can pass the text document from where you want to extract the year to the above function.

Thanks

Upvotes: 1

Toto
Toto

Reputation: 91385

A named capture group is: (?P<name>...) not (?<name>...).

Use: ^(?P<month>\w+),?\s[0-9]{4}$

Demo & explanation

Upvotes: 0

Joan Lara
Joan Lara

Reputation: 1396

import re

year = re.compile(r'(\b\d{1,2}\D{0,3})?\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?)\D?\D?(\d{1,4})')
print(year.match('September 1980').group(3))
print(year.match('October, 1978').group(3))

Output:

1980
1978

Upvotes: 0

Related Questions