Reputation: 1035
I'm trying to create a dataframe with pandas and drop dates later than say 201702. The dataframe is structured as so
Date Account Number
1 2019-02-21 123841234
2 2017-01-01 193741927
3 2015-03-04 981237432
4 2018-05-29 134913473
5 2012-05-12 138749173
6 2009-01-04 174917239
I'm reading in the csv (data.csv) and attempting to remove any date after 2017-02-28 as so:
data_csv = pd.read_csv('data.csv')
data_csv[data_csv.DATE < '2017-02-28']
Is this supposed to work correctly with a date format of YYYY-MM-DD or is there something I'd have to do to the column format to ensure that these rows are dropped from the dataframe?
Thank you for your time.
Upvotes: 3
Views: 11031
Reputation: 402263
If your date strings are in YYYY-MM-DD format then lexicographical comparisons work out of the box (for python in general, not just pandas).
'2009-01-04' < '2017-02-28'
# True
'2019-01-04' < '2017-02-28'
# False
So your comparison should work out without any changes. Although it'd be safer to convert to datetime first, so your code makes no assumptions while still working.
df.dtypes
Date object
Account Number int64
dtype: object
df[df['Date'] < '2017-02-28']
Date Account Number
2 2017-01-01 193741927
3 2015-03-04 981237432
5 2012-05-12 138749173
6 2009-01-04 174917239
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df.dtypes
Date datetime64[ns]
Account Number int64
dtype: object
df[df['Date'] < '2017-02-28']
Date Account Number
2 2017-01-01 193741927
3 2015-03-04 981237432
5 2012-05-12 138749173
6 2009-01-04 174917239
Upvotes: 3
Reputation: 61910
I suggest you transform the string into a Timestamp
, assuming data_csv.Date
is also a Timestamp
:
result = data_csv[data_csv.Date < pd.to_datetime('2017-02-28')]
print(result)
Output
Date Account Number
1 2017-01-01 193741927
2 2015-03-04 981237432
4 2012-05-12 138749173
5 2009-01-04 174917239
Upvotes: 2