String to date, but only month and year

Question

I have a dataset that includes a single column titled DATE. It had only a year, and then a dash (-) and then a month in two digits - like this: 2002-03, or 2007-11. It was a string, but using the to_datetime command resulted in it choosing the first day of every month (adding a day in, extraneously). I used the to_datetime with the format command '%Y%m'. Ultimately, I really just want to sort this column by the year, then the month and get an average from another column for everything in that month and year. I suppose I could still do this, even with the randomly added "day" date, but it doesn't seem like a very clean way to do it. What am I doing wrong?

JuliettVictor · Accepted Answer

Let's say your dataframe looks like this

import pandas as pd
df = pd.DataFrame({'date':['2021-01','2021-02','2021-03','2021-04']})

Option 1: dates as `pd.Period`

df['date_period'] = pd.to_datetime(df['date'],format='%Y-%m').dt.to_period('M')

You can access years and months via

df['year'] = df['date_period'].dt.year
df['month'] = df['date_period'].dt.month

Option 2: dates as integer

df['date_int'] = df['date'].str.replace('-','').astype(int)

You can access years and months via

df['year'] = df['date_int'] // 100
df['month'] = df['date_int'] % 100

Comparison

The result looks like this:

      date date_period  date_int
0  2021-01     2021-01    202101
1  2021-02     2021-02    202102
2  2021-03     2021-03    202103
3  2021-04     2021-04    202104

The second option is approximately twice as fast as the first one:

%timeit pd.to_datetime(df['date'],format='%Y-%m').dt.to_period('M')

703 µs ± 78.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df['date'].str.replace('-','').astype(int)

304 µs ± 8.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

String to date, but only month and year

Answers (2)

Option 1: dates as `pd.Period`

Option 2: dates as integer

Comparison

Related Questions

String to date, but only month and year

Answers (2)

Option 1: dates as pd.Period

Option 2: dates as integer

Comparison

Related Questions

Option 1: dates as `pd.Period`