working with dates in pandas - remove unseen characters in datetime and convert to string

Question

I am using pandas to import data dfST = read_csv( ... , parse_dates={'timestamp':[date]}) In my csv, date is in the format YYY/MM/DD, which is all I need - there is no time. I have several data sets that I need to compare for membership. When I convert theses 'timestamp' to a string, sometimes I get something like this:

'1977-07-31T00:00:00.000000000Z'

which I understand is a datetime including milliseconds and a timezone. Is there any way to suppress the addition of the extraneous time on import? If not, I need to exclude it somehow.

dfST.timestamp[1]
Out[138]: Timestamp('1977-07-31 00:00:00')

I have tried formatting it, which seemed to work until I called the formatted values:

dfSTdate=pd.to_datetime(dfST.timestamp, format="%Y-%m-%d")  
dfSTdate.head()
Out[123]: 
0   1977-07-31
1   1977-07-31
Name: timestamp, dtype: datetime64[ns]

But no... when I test the value of this I also get the time:

dfSTdate[1]
Out[124]: Timestamp('1977-07-31 00:00:00')

When I convert this to an array, the time is included with the millisecond and the timezone, which really messes my comparisons up.

test97=np.array(dfSTdate)
test97[1]
Out[136]: numpy.datetime64('1977-07-30T20:00:00.000000000-0400')

How can I get rid of the time?!? Ultimately I wish to compare membership among data sets using numpy.in1d with date as a string ('YYYY-MM-DD') as one part of the comparison

joris · Accepted Answer

This is due to the way datetime values are stored in pandas: using the numpy datetime64[ns] dtype. So datetime values are always stored at nanosecond resolution. Even if you only have a date, this will be converted to a timestamp with a zero time of nanosecond resolution. This is just due to the implementation in pandas.

The issues you have with printing the values and having unexpected results, is just because how these objects are printed in the python console (their representation), not their actual value.
If you print a single values, you get a the Timestamp representation of pandas:

Timestamp('1977-07-31 00:00:00')

So you get the seconds here as well, just because this is the default representation.
If you convert it to an array, and then print it, you get the standard numpy representation:

numpy.datetime64('1977-07-30T20:00:00.000000000-0400')

This is indeed a very misleading representation. Because numpy will, just for printing it in the console, convert it to your local timezone. But this doesn't change your actual value, it's just weird printing.

That is the background, now to answer your question, how do I get rid of the time?
That depends on your goal. Do you really want to convert it to a string? Or do you just don't like the repr?

if you just want to work with the datetime values, you don't need to get rid of it.
if you want to convert it to strings, you can apply strfitme (df['timestamp'].apply(lambda x: x.strftime('%Y-%m-%d'))). Or if it is to write it as strings to csv, use the date_format keyword in to_csv
if you really want a 'date', you can use the datetime.date type (standard python type) in a DataFrame column. You can convert your existing column to this with with: pd.DatetimeIndex(dfST['timestamp']).date. But personally I don't think this has many advantages.

working with dates in pandas - remove unseen characters in datetime and convert to string

Answers (1)

Related Questions