Reputation: 47
I have a dataset that includes a single column titled DATE. It had only a year, and then a dash (-) and then a month in two digits - like this: 2002-03, or 2007-11. It was a string, but using the to_datetime command resulted in it choosing the first day of every month (adding a day in, extraneously). I used the to_datetime with the format command '%Y%m'. Ultimately, I really just want to sort this column by the year, then the month and get an average from another column for everything in that month and year. I suppose I could still do this, even with the randomly added "day" date, but it doesn't seem like a very clean way to do it. What am I doing wrong?
Upvotes: 1
Views: 7188
Reputation: 644
Let's say your dataframe looks like this
import pandas as pd
df = pd.DataFrame({'date':['2021-01','2021-02','2021-03','2021-04']})
pd.Period
df['date_period'] = pd.to_datetime(df['date'],format='%Y-%m').dt.to_period('M')
You can access years and months via
df['year'] = df['date_period'].dt.year
df['month'] = df['date_period'].dt.month
df['date_int'] = df['date'].str.replace('-','').astype(int)
You can access years and months via
df['year'] = df['date_int'] // 100
df['month'] = df['date_int'] % 100
The result looks like this:
date date_period date_int
0 2021-01 2021-01 202101
1 2021-02 2021-02 202102
2 2021-03 2021-03 202103
3 2021-04 2021-04 202104
The second option is approximately twice as fast as the first one:
%timeit pd.to_datetime(df['date'],format='%Y-%m').dt.to_period('M')
703 µs ± 78.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df['date'].str.replace('-','').astype(int)
304 µs ± 8.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Upvotes: 1
Reputation: 8508
You can use pd.to_datetime and it should be able to convert the data into datetime format.
import pandas as pd
df = pd.DataFrame({'Date':['2020-10','2020-11','2020-12','2020-01','2021-02','2021-03']})
print (df)
df['Date'] = pd.to_datetime(df['Date'])
print (df)
The DataFrame in string format:
Date
0 2020-10
1 2020-11
2 2020-12
3 2020-01
4 2021-02
5 2021-03
The DataFrame in date format:
Date
0 2020-10-01
1 2020-11-01
2 2020-12-01
3 2020-01-01
4 2021-02-01
5 2021-03-01
If you want to sort them by date, you can just do:
df = df.sort_values ('Date')
print (df)
Note here that 4th value is 2020-01-01
and will get sorted to top after the df.sort_values
The output of this will be:
Date
3 2020-01-01
0 2020-10-01
1 2020-11-01
2 2020-12-01
4 2021-02-01
5 2021-03-01
If you want the index to be reset, you can do so with .reset_index(drop=True)
If you want the string value to be converted to Date format but still maintain the YYYY-MM structure, you can use
df['Date'] = pd.to_datetime(df['Date']).dt.strftime('%Y-%m')
However, this will convert the Date
column to String format and not Datetime.
If you want to calculate the mean value of the year-month, then you can simply use the groupby on the Date
column as follows:
import pandas as pd
df = pd.DataFrame({'Date':['2020-10','2020-11','2020-12','2020-01','2021-02',
'2021-03','2020-10','2020-12','2020-11','2021-02'],
'Score':[400,500,300,200,400,300,200,300,400,300]})
df['Date'] = pd.to_datetime(df['Date'],format='%Y-%m') #.dt.strftime('%Y-%m')
#df = df.sort_values ('Date').reset_index(drop=True)
print (df.groupby('Date')['Score'].mean().reset_index().rename(columns={'Score':'mean_score'}))
The output of this will be:
Original DataFrame:
Date Score
0 2020-10 400
1 2020-11 500
2 2020-12 300
3 2020-01 200
4 2021-02 400
5 2021-03 300
6 2020-10 200
7 2020-12 300
8 2020-11 400
9 2021-02 300
Groupby Date and Mean for each month:
Date mean_score
0 2020-01-01 200
1 2020-10-01 300
2 2020-11-01 450
3 2020-12-01 300
4 2021-02-01 350
5 2021-03-01 300
Upvotes: 1