Amie
Amie

Reputation: 103

Converting inconsistently formatted string dates to datetime in pandas

I have a pandas dataframe in which the date information is a string with the month and year:

date = ["JUN 17", "JULY 17", "AUG 18", "NOV 19"]

Note that the month is usually written as the 3 digit abbreviation, but is sometimes written as the full month for June and July.

I would like to convert this into a datetime format which assumes each date is on the first of the month:

date = [06-01-2017, 07-01-2017, 08-01-2018, 11-01-2019]

Edit to provide more information: Two main issues I wasn't sure how to handle:

  1. Month is not in a consistent format. Tried to solve this using by just taking a subset of the first three characters of the string.
  2. Year is last two digits only, was struggling to specify that it is 2020 without it getting very messy I have tried a dozen different things that didn't work, most recent attempt is below:
df['date'] = pd.to_datetime(dict(year = df['Record Month'].astype(str).str[-2:], month = df['Record Month'].astype(str).str[0:3], day=1))

This has the error "Unable to parse string "JUN" at position 0

Upvotes: 1

Views: 881

Answers (2)

Craig
Craig

Reputation: 4855

You were close with using pandas.to_datetime(). Instead of using a dictionary though, you could just reformat the date strings to a more standard format. If you convert each date string into MMMYY format (pretty similar to what you were doing) you can pass the strftime format "%b%y" to to_datetime() and it will convert the strings into dates.

import pandas as pd

date = ["JUN 17", "JULY 17", "AUG 18", "NOV 19"]
df = pd.DataFrame(date, columns=["Record Month"])

df['date'] = pd.to_datetime(df["Record Month"].str[:3] + df["Record Month"].str[-2:], format='%b%y')

print(df)

Produces that following result:

  Record Date       date
0      JUN 17 2017-06-01
1     JULY 17 2017-07-01
2      AUG 18 2018-08-01
3      NOV 19 2019-11-01

Upvotes: 0

RichieV
RichieV

Reputation: 5183

If you are not sure of the many spellings that can show up then a dictionary mapping would not work. Perhaps your best chance is to split and slice so you normalize into year and month columns and then build the date.

If date is a list as in your example.

date = [d.split() for d in date]
df = pd.DataFrame([m[:3].lower, '20' + y] for m, y in date],
# df = pd.DataFrame([[s.split()[0][:3].lower, '20' + s.split()[1]] for s in date],
    columns=['month', 'year'])

Then pass a mapper to series.replace as in

df.month = df.month.replace({'jan': 1, 'feb': 2 ...})

Then parse the dates from its components

# first cap the date to the first day of the month
df['day'] = 1
df = pd.to_datetime(df)

Upvotes: 1

Related Questions