Ivan Mera
Ivan Mera

Reputation: 51

how to extract year from different types of date in python

I've a column with different types of date such as:

2\06\1998
21.11.1998  
18-02-2001
03/05/1999 
20 july 1999

I only want the year.

I tried different type of regex such as:

def get_date(date):
        number= re.findall('\[0-9]\-{0,1}\\{0,1}\/{0,1}\[0-9]\-{0,1}\\{0,1}\/{0,1}\[0-9]', date)
        return number[6:]

but I can't extract the year. what's the most suitable regex for this case? it's not a problem to do two types of regex, one for the format dd/mm/yyyy and one for the date with month in letter.

Upvotes: 2

Views: 3358

Answers (4)

DeepSpace
DeepSpace

Reputation: 81594

Don't fall into the regex/'strip the last 4 characters' rabbit-hole. If a date in another format arrives (for example 2019-08-27) any naive regex/stripping solution will break.

Use pd.to_datetime to let pandas deal with the parsing, then just grab dt.year.

df = pd.DataFrame({'a': ['2/06/1998', '21.11.1998', '18-02-2001', '03/05/1999',
                         '20 july 1999', '2019-08-27']})
df['a'] = pd.to_datetime(df['a'])
print(df['a'].dt.year)

Outputs

0    1998
1    1998
2    2001
3    1999
4    1999
5    2019

Note: Notice that I had to change the direction of the slashes (2\06\1998 to 2/06/1998) but it's a very small price to pay for getting a far more robust solution in return.

Upvotes: 3

Olvin Roght
Olvin Roght

Reputation: 7812

I would use simple \d{4} regex.

import re

s = """2\\06\\1998
21.11.1998  
18-02-2001
03/05/1999 
20 july 1999"""
for date in s.splitlines():
    year = re.search(r"\d{4}", date).group(0)
    print(year)

Upvotes: 2

The fourth bird
The fourth bird

Reputation: 163217

You could make use of 2 capturing groups, where in the first group you capture the divider to match a consistent divider for the second one by using a back reference \1.

The year part is captured in the second group.

^\d+([\\/. -])(?:\d+|[a-z]+)\1(\d{4})$

Regex demo

Instead of using anchors ^ and $ you could use also use lookarounds

(?<!\S)\d+([\\/. -])(?:\d+|[a-z]+)\1(\d{4})(?!\S)

Regex demo

Pattern parts

  • (?<!\S) Assert what is on the left is not a non whitespace char
  • \d+ Match 1+ digits
  • ([\\/. -]) Capture group 1, match any of the listed
  • (?: Non capturing group
    • \d+ Match 1+ digits
    • | or
    • [a-z]+ Match 1+ lowercase chars
  • ) Close non caputring group
  • \1 Backreference to what is captured in group 1
  • (\d{4}) Capture group 2, match 4 digits for the year
  • (?!\S) Assert what is on the right is not a non whitespace char

Upvotes: 0

Ted
Ted

Reputation: 1233

Pandas to_datetime is surprisingly good at recognising different date formats. The only problem it will have is with backslashes, but if you can replace them using string formatting then I think it's easier than using a regex.

import pandas as pd
df = pd.DataFrame({"date": ["2\\06\\1998", "21.11.1998", "18-02-2001", "03/05/1999", "20 july 1999"]})

df["date"] = df["date"].str.replace("\\", "/")
df["date"] = pd.to_datetime(df["date"])
df["date"].dt.year

0    1998
1    1998
2    2001
3    1999
4    1999
Name: date, dtype: int64

Upvotes: 0

Related Questions