wawawa
wawawa

Reputation: 3375

DateTime filter in pandas

I have a csv file like this and this is the code I wrote to filter the date

example['date_1'] = pd.to_datetime(example['date_1'])
example['date_2'] = pd.to_datetime(example['date_2'])

example 
date_1        ID      date_2
2015-01-12    111     2016-01-20 08:34:00
2016-01-11    222     2016-12-15 08:34:00
2016-01-11    7770    2016-12-15 08:34:00
2016-01-10    7881    2016-11-17 08:32:00
2016-01-03    90243   2016-04-14 08:35:00
2016-01-03    90354   2016-04-14 08:35:00
2015-01-11    1140303 2015-12-15 08:43:00
2015-01-11    1140414 2015-12-15 08:43:00

example[(example['date_1'] <= '2016-11-01') 
        & (example['date_1'] >= '2015-11-01') 
        & (example['date_2'] <= '2016-12-16') 
        & (example['date_2'] >= '2015-12-15')]
Output:

2016-01-11    222     2016-12-15 08:34:00
2016-01-11    7770    2016-12-15 08:34:00
2016-01-10    7881    2016-11-17 08:32:00
2016-01-03    90243   2016-04-14 08:35:00
2016-01-03    90354   2016-04-14 08:35:00

I don't understand why it changes the format of the date, and it seems like it mix up the month&day in the date, with the conditional filter, the expected result should be the same with the original dataset, but it erased several lines? Can someone help me with it, many thanks.

Upvotes: 0

Views: 70

Answers (1)

Serge Ballesta
Serge Ballesta

Reputation: 149185

Some locales format the date as dd/mm/YYYY, while others use mm/dd/YYYY. By default pandas uses the american format of mm/dd/YYYY unless it can infer the alternate format from the values (when a day number is greater than 12...).

So if you know that you input date format is dd/mm/YYYY, you must say it to pandas:

example['date_1'] = pd.to_datetime(example['date_1'], dayfirst=True)
example['date_2'] = pd.to_datetime(example['date_2'], dayfirst=True)

Once pandas has a Timestamp column, it internally stores a number of nano seconds from 1970-01-01 00:00, and by default displays it according to ISO-8601, striping parts that are 0 for the columns. Parts being the full time, fractions of seconds or nanoseconds.

You should not care if you want to process the Timestamps. If at the end you want to force a format, explicitely change the column to its string representation:

df['date_1'] = df['date_1'].df.strftime('%d/%m/%Y %H:%M')

Upvotes: 1

Related Questions