Reputation: 23
I have a dataset in csv format which contains dates in a column. I have imported this dataset into python pandas, and this date column is shown as an object. I need to convert this column to date time but i have a problem. This date column has date format in two formats 1. 11/7/2013 11:51 2. 13-07-2013 08:33:16
I need to convert one format to another one in order to have a standard date format in my python to do analysis. How can i do this?
There are many rows of date in both these formats, so when i try to convert second format to first format using the below code
print(df['date'].apply(lambda x: pd.to_datetime(x, format='%d/%m/%Y %H:%M')))
i get the below error
ValueError: time data '13-07-2013 08:33:16' does not match format '%d/%m/%Y %H:%M' (match)
so what would be the best method to standardise this column in one format?
Upvotes: 1
Views: 4804
Reputation: 1731
In Pandas v1 to_datetime
function is very robust and can handle most date formats. With your example dates it is as easy as calling to_datetime
on your series.
d = ['11/7/2013 11:51', '13-07-2013 08:33:16']
df = pd.DataFrame({'dates': d})
df = pd.to_datetime(df['dates'])
df
output
0 2013-11-07 11:51:00
1 2013-07-13 08:33:16
Name: dates, dtype: datetime64[ns]
Just how robust is to_datetime?
lets tests it using a dataset of 25 different date styles from here.
http = r'https://www.ibm.com/docs/en/cmofz/10.1.0?topic=SSQHWE_10.1.0/com.ibm.ondemand.mp.doc/arsa0257.htm'
table = pd.read_html(http)
df = table[0]
df
# test which datestyles pandas can convert
df['Example_clean'] = pd.to_datetime(df['Example'])
print(df.dtypes)
df
# yes it converted all 25 different date formats!!
# Note - when using a time it automatically uses todays date.
Upvotes: 0
Reputation: 13437
You can try the following
import pandas as pd
import numpy as np
n=1000
ch = ['13-07-2013 08:33:16', '13/07/2013 08:33:16']
df = pd.DataFrame({"date": np.random.choice(ch,n)})
df["date"] = df["date"].str.replace("/","-").astype("M8[us]")
Update
I just realize that the format I'm using is not the same you would like. I strongly suggest you to use a standard format as YYYY-MM-DD
and datetime as type instead of string. There are a lot of post that explains why this is more efficient on RAM and then faster.
A small comparative for a df with just 1000 rows
%%timeit
out = df["date"].str.replace("/","-").astype("M8[us]")
146 ms ± 5.62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
out = df["date"].apply(lambda x: pd.to_datetime(x)\
.strftime('%d/%m/%Y %H:%M'))
396 ms ± 22.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
out = df['date'].apply(lambda x: pd.to_datetime(x,
format='%d/%m/%Y %H:%M',
infer_datetime_format= True))
425 ms ± 4.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Upvotes: 0
Reputation: 17007
its better to use strftime():
df = pd.DataFrame({'Date': ['11/7/2013 11:51','13-07-2013 08:33:16']})
df['Clean_Date'] = df.Date.apply(lambda x: pd.to_datetime(x).strftime('%d/%m/%Y %H:%M'))
print(df)
output:
Date Clean_Date
0 11/7/2013 11:51 07/11/2013 11:51
1 13-07-2013 08:33:16 13/07/2013 08:33
Upvotes: 1
Reputation: 2776
Try removing the format
parameter and setting infer_datetime_format=True
in the arguments you pass to pd.to_datetime
Upvotes: 1