Ananthu
Ananthu

Reputation: 159

Regex: Can't replicate the expected output in jupyter notebook which is done in another site

My regex expression is matching other unexpected groups. My aim is to extract date of specified format(month in letters followed by years, ex. Mar 2009) but the expression matches and captures the other formats like 20 March 2009. The input is as follows.

df5 = pd.Series(["04/20/2009", "04/20/09", "4/20/09", "4/3/09", "Mar-20-2009", "Mar 20, 2009", "March 20, 2009", "Mar. 20, 2009", "Mar 20 2009", "20 Mar 2009","20 March 2009", "20 Mar. 2009", "20 March, 2009", "Mar 20th, 2009", "Mar 21st, 2009", "Mar 22nd, 2009", "Feb 2009", "Sep 2009", "Oct 2010", "6/2008","12/2009", "2009", "2010"])

The regex expression I used df5.str.extractall(r'(?P<date>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z., -]*\d{4})') I then rechecked my expression in regex101 website and made changes to it. The changed expression is as follows

[^ ](?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z., -]*\d{4}

But the changed expression does not match any values in the dataframe, whereas with the changed expression I'm able to obtain my necessary [output] in regex101 website. Where am I going wrong?

Upvotes: 0

Views: 110

Answers (1)

Levi
Levi

Reputation: 7343

The first one looked ok, you just need to begin with a line-start character ^:

df5.str.extractall(r'^(?P<date>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z., \-]*\d{4})')

16 0      Feb 2009
17 0      Sep 2009
18 0      Oct 2010

Edit:

You might want to add that space to be mandatory, so take it out of the square brackets:

df5.str.extractall(r'^(?P<date>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z.,\-]* \d{4})')

Upvotes: 1

Related Questions