Reputation: 51
I've a column with different types of date such as:
2\06\1998
21.11.1998
18-02-2001
03/05/1999
20 july 1999
I only want the year.
I tried different type of regex such as:
def get_date(date):
number= re.findall('\[0-9]\-{0,1}\\{0,1}\/{0,1}\[0-9]\-{0,1}\\{0,1}\/{0,1}\[0-9]', date)
return number[6:]
but I can't extract the year. what's the most suitable regex for this case? it's not a problem to do two types of regex, one for the format dd/mm/yyyy and one for the date with month in letter.
Upvotes: 2
Views: 3358
Reputation: 81594
Don't fall into the regex/'strip the last 4 characters' rabbit-hole.
If a date in another format arrives (for example 2019-08-27
) any naive regex/stripping solution will break.
Use pd.to_datetime
to let pandas deal with the parsing, then just grab dt.year
.
df = pd.DataFrame({'a': ['2/06/1998', '21.11.1998', '18-02-2001', '03/05/1999',
'20 july 1999', '2019-08-27']})
df['a'] = pd.to_datetime(df['a'])
print(df['a'].dt.year)
Outputs
0 1998
1 1998
2 2001
3 1999
4 1999
5 2019
Note: Notice that I had to change the direction of the slashes (2\06\1998
to 2/06/1998
) but it's a very small price to pay for getting a far more robust solution in return.
Upvotes: 3
Reputation: 7812
I would use simple \d{4}
regex.
import re
s = """2\\06\\1998
21.11.1998
18-02-2001
03/05/1999
20 july 1999"""
for date in s.splitlines():
year = re.search(r"\d{4}", date).group(0)
print(year)
Upvotes: 2
Reputation: 163217
You could make use of 2 capturing groups, where in the first group you capture the divider to match a consistent divider for the second one by using a back reference \1
.
The year part is captured in the second group.
^\d+([\\/. -])(?:\d+|[a-z]+)\1(\d{4})$
Instead of using anchors ^
and $
you could use also use lookarounds
(?<!\S)\d+([\\/. -])(?:\d+|[a-z]+)\1(\d{4})(?!\S)
Pattern parts
(?<!\S)
Assert what is on the left is not a non whitespace char\d+
Match 1+ digits([\\/. -])
Capture group 1, match any of the listed(?:
Non capturing group
\d+
Match 1+ digits|
or[a-z]+
Match 1+ lowercase chars)
Close non caputring group\1
Backreference to what is captured in group 1(\d{4})
Capture group 2, match 4 digits for the year(?!\S)
Assert what is on the right is not a non whitespace charUpvotes: 0
Reputation: 1233
Pandas to_datetime is surprisingly good at recognising different date formats. The only problem it will have is with backslashes, but if you can replace them using string formatting then I think it's easier than using a regex.
import pandas as pd
df = pd.DataFrame({"date": ["2\\06\\1998", "21.11.1998", "18-02-2001", "03/05/1999", "20 july 1999"]})
df["date"] = df["date"].str.replace("\\", "/")
df["date"] = pd.to_datetime(df["date"])
df["date"].dt.year
0 1998
1 1998
2 2001
3 1999
4 1999
Name: date, dtype: int64
Upvotes: 0