KMW
KMW

Reputation: 47

String to date, but only month and year

I have a dataset that includes a single column titled DATE. It had only a year, and then a dash (-) and then a month in two digits - like this: 2002-03, or 2007-11. It was a string, but using the to_datetime command resulted in it choosing the first day of every month (adding a day in, extraneously). I used the to_datetime with the format command '%Y%m'. Ultimately, I really just want to sort this column by the year, then the month and get an average from another column for everything in that month and year. I suppose I could still do this, even with the randomly added "day" date, but it doesn't seem like a very clean way to do it. What am I doing wrong?

Upvotes: 1

Views: 7188

Answers (2)

JuliettVictor
JuliettVictor

Reputation: 644

Let's say your dataframe looks like this

import pandas as pd
df = pd.DataFrame({'date':['2021-01','2021-02','2021-03','2021-04']})

Option 1: dates as pd.Period

df['date_period'] = pd.to_datetime(df['date'],format='%Y-%m').dt.to_period('M')

You can access years and months via

df['year'] = df['date_period'].dt.year
df['month'] = df['date_period'].dt.month

Option 2: dates as integer

df['date_int'] = df['date'].str.replace('-','').astype(int)

You can access years and months via

df['year'] = df['date_int'] // 100
df['month'] = df['date_int'] % 100

Comparison

The result looks like this:

      date date_period  date_int
0  2021-01     2021-01    202101
1  2021-02     2021-02    202102
2  2021-03     2021-03    202103
3  2021-04     2021-04    202104

The second option is approximately twice as fast as the first one:

%timeit pd.to_datetime(df['date'],format='%Y-%m').dt.to_period('M')

703 µs ± 78.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df['date'].str.replace('-','').astype(int)

304 µs ± 8.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Upvotes: 1

Joe Ferndz
Joe Ferndz

Reputation: 8508

You can use pd.to_datetime and it should be able to convert the data into datetime format.

import pandas as pd
df = pd.DataFrame({'Date':['2020-10','2020-11','2020-12','2020-01','2021-02','2021-03']})
print (df)

df['Date'] = pd.to_datetime(df['Date'])
print (df)

The DataFrame in string format:

      Date
0  2020-10
1  2020-11
2  2020-12
3  2020-01
4  2021-02
5  2021-03

The DataFrame in date format:

        Date
0 2020-10-01
1 2020-11-01
2 2020-12-01
3 2020-01-01
4 2021-02-01
5 2021-03-01

If you want to sort them by date, you can just do:

df = df.sort_values ('Date')
print (df)

Note here that 4th value is 2020-01-01 and will get sorted to top after the df.sort_values

The output of this will be:

        Date
3 2020-01-01
0 2020-10-01
1 2020-11-01
2 2020-12-01
4 2021-02-01
5 2021-03-01

If you want the index to be reset, you can do so with .reset_index(drop=True)

If you want the string value to be converted to Date format but still maintain the YYYY-MM structure, you can use

df['Date'] = pd.to_datetime(df['Date']).dt.strftime('%Y-%m')

However, this will convert the Date column to String format and not Datetime.

If you want to calculate the mean value of the year-month, then you can simply use the groupby on the Date column as follows:

import pandas as pd
df = pd.DataFrame({'Date':['2020-10','2020-11','2020-12','2020-01','2021-02',
                           '2021-03','2020-10','2020-12','2020-11','2021-02'],
                  'Score':[400,500,300,200,400,300,200,300,400,300]})

df['Date'] = pd.to_datetime(df['Date'],format='%Y-%m') #.dt.strftime('%Y-%m')

#df = df.sort_values ('Date').reset_index(drop=True)

print (df.groupby('Date')['Score'].mean().reset_index().rename(columns={'Score':'mean_score'}))

The output of this will be:

Original DataFrame:

      Date  Score
0  2020-10    400
1  2020-11    500
2  2020-12    300
3  2020-01    200
4  2021-02    400
5  2021-03    300
6  2020-10    200
7  2020-12    300
8  2020-11    400
9  2021-02    300

Groupby Date and Mean for each month:

        Date  mean_score
0 2020-01-01         200
1 2020-10-01         300
2 2020-11-01         450
3 2020-12-01         300
4 2021-02-01         350
5 2021-03-01         300

Upvotes: 1

Related Questions