Reputation: 397
I have an excel spreadsheet with some datetime data in a column. I exported the data into a dataframe using pandas. This column however has chunks of dates that have the month and day swapped, while there are other chunks of dates in the same column that are correct. Here's an example -
Figure 1: Day and month have been swapped incorrectly
The above picture shows the day and month swapped. The date shows 2016-01-10
but should instead be 2016-10-01
. Compare this with another chunk of datetime values in the same column -
Figure 2: Day and month are correctly represented
In the above case in Figure 2, the month correctly represented as 12
and the day is 31
.
I used the solution from this question - How to swap months and days in a datetime object?
I also tried using this solution - Python Pandas - Day and Month mix up
I also tried writing my own function to map to the entries but this was to no avail either -
def dm_swap(day, month):
if(month != 10 or month != 11 or month != 12):
temp = day
day = month
month = temp
t2016Q4.start.map(dmswap, t2016Q4.dt.day, t2016Q4.dt.month)
However, both solutions change all the datetime values in the column. So, when the incorrect values get corrected, the correct values become incorrect.
I've also linked the excel file for your convenience. It's an open data set.
Please choose the last dataset Bikeshare Ridership (2016 Q4). The "start" and "end" columns have the above mentioned issues.
Is there a more efficient way to clean the datetime data?
Upvotes: 2
Views: 1561
Reputation: 879749
Andrew observed that the DataFrame can be fixed by flipping all months and days when doing so results in a valid date.
Here's a fast way to "flip" all the dates. Invalid dates are coerced to NaT (Not-A-Timestamp) values and then dropped. The remaining flipped dates can then be reassigned to df
:
import pandas as pd
df = pd.read_excel('2016_Bike_Share_Toronto_Ridership_Q4.xlsx')
for col in ['trip_start_time', 'trip_stop_time']:
df[col] = pd.to_datetime(df[col])
swapped = pd.to_datetime({'year':df[col].dt.year,
'month':df[col].dt.day,
'day':df[col].dt.month,
'hour':df[col].dt.hour,
'minute':df[col].dt.minute,
'second':df[col].dt.second,}, errors='coerce')
swapped = swapped.dropna()
mask = swapped.index
df.loc[mask, col] = swapped
# check that now all dates are in 2016Q4
for col in ['trip_start_time', 'trip_stop_time']:
mask = (pd.PeriodIndex(df[col], freq='Q') == '2016Q4')
assert mask.all()
# check that `trip_start_times` are in chronological order
assert (df['trip_start_time'].diff().dropna() >= pd.Timedelta(0)).all()
# check that `trip_stop_times` are always greater than `trip_start_times`
assert ((df['trip_stop_time']-df['trip_start_time']).dropna() >= pd.Timedelta(0)).all()
The assert statements above verify that the resultant dates are all in 2016Q4, that trip_start_times
are in chronological order, and that the trip_stop_times
are always greater than their associated trip_start_times
.
Upvotes: 0
Reputation: 1082
OK.
EDIT -again. I ran the code below, and it took forever! I aborted in the end, but this definitely works in sensible time as well - good luck!:
import pandas as pd
f = "string\to\file\here.xlsx"
df = pd.read_excel(f)
def alter_date(timestamp):
try:
date_time = timestamp.to_datetime().strftime("%Y-%d-%m %H:%M:%S")
time_stamp = pd.Timestamp(date_time)
return time_stamp
except:
return timestamp
new_starts = df["trip_start_time"].apply(alter_date)
df["trip_start_time"] = new_starts
new_ends = df["trip_stop_time"].apply(alter_date)
df["trip_stop_time"] = new_ends
EDIT: I had a bit of a dig, and it looks possible based on what I've done previously, new code here:
import pandas as pd
f = "string\to\file\here.xlsx"
df = pd.read_excel(f)
for idx in df.index:
trip_start = df.loc[df.index[idx], "trip_start_time"]
trip_end = df.loc[df.index[idx], "trip_stop_time"]
start_dt = trip_start.to_datetime()
end_dt = trip_end.to_datetime()
try:
start_dt_string = start_dt.strftime("%Y-%d-%m %H:%M:%S")
end_dt_string = end_dt.strftime("%Y-%d-%m %H:%M:%S")
start_ts = pd.Timestamp(start_dt_string)
end_ts = pd.Timestamp(end_dt_string)
df.loc[idx, "trip_start_time"] = start_ts
df.loc[idx, "trip_stop_time"] = end_ts
except ValueError:
pass
It's a bit slow (there's a bunch of rows of data) but my computer seems to be working through it - will update again if it fails.
Old response: So, what's happened is that every date/time where there is no possibility of ambiguity is in the original dataset in the format: DD/MM/YYYY HH:MM:SS.
If it's possible to co-erce to MM/DD/YY HH:MM:SS then it has
What I would do, is to iterate over each column
for row in df.index:
try:
new_dt = datetime.strptime(row, "%Y-%d-%m %H:%M:%S")
#write back to the df here
except ValueError:
pass#ignore anything that cannot be converted
Upvotes: 1
Reputation: 595
You can use the parameter format
in pd.to_datetime
:
>>> date= pd.Series(['2016-01-10', '2016-02-10'])
>>> pd.to_datetime(date, format='%Y-%d-%m')
Out:
0 2016-10-01
1 2016-10-02
Upvotes: 0