rachelvsamuel
rachelvsamuel

Reputation: 1721

Translating a messy date string in a DataFrame to `datetime` in python and pandas

I have a dataframe that lists the date in several formats. For example:

       date
201    July 17,2019
555    06-06-2019
295    28/3/2019

Is there a way to translate all of those dates to a datetime?

Upvotes: 4

Views: 1489

Answers (2)

felipe
felipe

Reputation: 8035

Check Edit below for better answer.

Normally for a task like this you would use pandas.Series.dt.strftime() to convert the entire column to a specific DateTime format, but since it seems you have different formats in different rows you should probably instead look intodatetime.strptime().

Traverse through that column of Date an through if (or try) statement's changes the dates (pseudocode for demonstration):

for date in date_column:
    if "," in date:
        # strptime()
    elif "-" in date:
        # strptime()
    elif "/" in date:
        dt = datetime.strptime(date, "%d/%m/%y")

    column_row = dt

Here is the strftime() format codes, which you will need to set the second param of the function.


Edit

You may also do this by utilizing dateutil.parser.parse (make sure to install it first via pip install python-dateutil):

from dateutil.parser import parse

df['date'] = df.date.apply(parse)

So for example:

from dateutil.parser import parse
import pandas as pd
import numpy as np


def parse_date(date):
    if date is np.nan:
        return np.nan
    else:
        return parse(date)


df = pd.DataFrame(
    {
        "date": [
            "July 17,2019",
            "06-06-2019",
            "28/3/2019",
            "12/3/19",
            "31-12-19",
            np.nan,
        ],
        "text": ["t1", "t2", "t3", "t4", "t5", "t6"],
    }
)

df["date"] = df.date.apply(parse_date)

Which converts:

           date text
0  July 17,2019   t1
1    06-06-2019   t2
2     28/3/2019   t3
3       12/3/19   t4
4      31-12-19   t5
5           NaN   t6

to

        date text
0 2019-07-17   t1
1 2019-06-06   t2
2 2019-03-28   t3
3 2019-12-03   t4
4 2019-12-31   t5
5        NaT   t6

You can also accomplish the same as the function by using a lambda expression as following:

df["date"] = df.date.apply(lambda date: np.nan if date is np.nan else parse(date))

Upvotes: 3

oppressionslayer
oppressionslayer

Reputation: 7224

I would create a an apply function that checks for a regex, and returns the datetime based on that regex, running through a list of if/elif commands and returning the result, something like:

def checkdate(x):
   if pd.Series(x).str.contains(r'\w+ \d+,\d+')[0]: 
     return datetime.datetime.strptime(x, '%B %d,%Y') 
   #elif next str.contains, etc
   else:
     return np.nan

df.date.apply(checkdate) 

sample out

0   2019-07-17
1          NaT
2          NaT

and then you can reset the date column to those values with df = df.date.apply(checkdate)

Upvotes: 1

Related Questions