NinjaGaiden
NinjaGaiden

Reputation: 3146

Pandas datetime formatting

Is it possible to represent a pd.to_datetime with suffix of zeros? It seems the zeros are being removed.

print pd.to_datetime("2000-07-26 14:21:00.00000",
                format="%Y-%m-%d %H:%M:%S.%f")

The result is

2000-07-26 14:21:00

The desired result would be

2000-07-26 14:21:00.00000

I know the values mean the same thing but it would be nice for consistency.

Upvotes: 3

Views: 4938

Answers (2)

user4322779
user4322779

Reputation:

Doing some testing shows that when formatting date-time data with format="%H:%M:%S.%f", %f is capable of nanosecond resolution provided the ninth digit after the decimal place is non-zero. When formatting a string, a variable number of trailing zeros from none to five are added depending on the position of the least significant digit after the decimal point and given that its also the final digit. Here's a table of that from test data where position is the position of the least significant non-zero and also final digit and zeros is the number of trailing zeros added by formatting:

    position zeros
       9      0
       8      1
       7      2
       6      0
       5      1
       4      2
       3      3
       2      4
       1      5

When a column is formatted with "%H:%M:%S.%f" as a whole, all of its elements will have the same number of digits after the decimal point, which may be done by adding or removing trailing zeros even if that increases or decreases the resolution of raw data. I guess reasons for this are consistency and pleasing aesthetics without usually introducing excessive error, since in numeric calculations trailing zeros don't typically affect immediate results, however they can affect estimation of their error and how they should be presented (Trailing Zeros, Rules for Significant Figures).

Below are some observations of applying "%H:%M:%S.%f" format to individual strings and pandas.Series (DataFrame columns) with pandas.to_datetime and applying pandas.DataFrame.convert_objects(convert_dates='coerce') to DataFrames with a column that could be converted to datetime.

On a string pandas preserves a non-zero digit in up to the ninth decimal place in time conversion with "%H:%M:%S.%f" and adds a date if a one was not provided:

import pandas as pd
pd.to_datetime ("10:00:00.000000001",format="%H:%M:%S.%f")
Out[15]: Timestamp('1900-01-01 10:00:00.000000001')

pd.to_datetime ("2015-09-17 10:00:00.000000001",format="%Y-%m-%d %H:%M:%S.%f")
Out[15]: Timestamp('2015-09-17 10:00:00.000000001')

Prior to that and for tests in which the final non-zero digit is the final digit, it adds up to five trailing zeros after the final non-zero digit increasing the resolution of the raw data except when the final non-zero digit is in position six to the right of the decimal place:

pd.to_datetime ("10:00:00.00000001",format="%H:%M:%S.%f")
Out[15]: Timestamp('1900-01-01 10:00:00.000000010')

pd.to_datetime ("2015-09-17 10:00:00.00000001",format="%Y-%m-%d %H:%M:%S.%f")
Out[16]: Timestamp('2015-09-17 10:00:00.000000010')

pd.to_datetime ("10:00:00.0000001",format="%H:%M:%S.%f")
Out[15]: Timestamp('1900-01-01 10:00:00.000000100')

pd.to_datetime ("2015-09-17 10:00:00.0000001",format="%Y-%m-%d %H:%M:%S.%f")
Out[17]: Timestamp('2015-09-17 10:00:00.000000100')

pd.to_datetime ("10:00:00.000001",format="%H:%M:%S.%f")
Out[33]: Timestamp('1900-01-01 10:00:00.000001')

pd.to_datetime ("2015-09-17 10:00:00.000001",format="%Y-%m-%d %H:%M:%S.%f")
Out[18]: Timestamp('2015-09-17 10:00:00.000001')

pd.to_datetime ("10:00:00.00001",format="%H:%M:%S.%f")
Out[6]: Timestamp('1900-01-01 10:00:00.000010')

pd.to_datetime ("2015-09-17 10:00:00.00001",format="%Y-%m-%d %H:%M:%S.%f")
Out[19]: Timestamp('2015-09-17 10:00:00.000010')

pd.to_datetime ("10:00:00.0001",format="%H:%M:%S.%f")
Out[9]: Timestamp('1900-01-01 10:00:00.000100')

pd.to_datetime ("2015-09-17 10:00:00.0001",format="%Y-%m-%d %H:%M:%S.%f")
Out[21]: Timestamp('2015-09-17 10:00:00.000100')

pd.to_datetime ("10:00:00.001",format="%H:%M:%S.%f")
Out[10]: Timestamp('1900-01-01 10:00:00.001000')

pd.to_datetime ("2015-09-17 10:00:00.001",format="%Y-%m-%d %H:%M:%S.%f")
Out[22]: Timestamp('2015-09-17 10:00:00.001000')

pd.to_datetime ("10:00:00.01",format="%H:%M:%S.%f")
Out[12]: Timestamp('1900-01-01 10:00:00.010000')

pd.to_datetime ("2015-09-17 10:00:00.01",format="%Y-%m-%d %H:%M:%S.%f")
Out[24]: Timestamp('2015-09-17 10:00:00.010000'

pd.to_datetime ("10:00:00.1",format="%H:%M:%S.%f")
Out[13]: Timestamp('1900-01-01 10:00:00.100000')

pd.to_datetime ("2015-09-17 10:00:00.1",format="%Y-%m-%d %H:%M:%S.%f")
Out[26]: Timestamp('2015-09-17 10:00:00.100000')

Let see how it works with a DataFrame:

!type test.csv # here type is Windows substitute for Linux cat command
date,mesg
10:00:00.000000001,one
10:00:00.00000001,two
10:00:00.0000001,three
10:00:00.000001,four
10:00:00.00001,five
10:00:00.0001,six
10:00:00.001,seven
10:00:00.01,eight
10:00:00.1,nine
10:00:00.000000001,ten
10:00:00.000000002,eleven
10:00:00.000000003,twelve

df = pd.read_csv('test.csv')
df
Out[30]: 
                  date    mesg
0   10:00:00.000000001     one
1    10:00:00.00000001     two
2     10:00:00.0000001   three
3      10:00:00.000001    four
4       10:00:00.00001    five
5        10:00:00.0001     six
6         10:00:00.001   seven
7          10:00:00.01   eight
8           10:00:00.1    nine
9   10:00:00.000000001     ten
10  10:00:00.000000002  eleven
11  10:00:00.000000003  twelve

df.dtypes
Out[31]: 
date    object
mesg    object
dtype: object

Datetime conversion of a DataFrame with convert_objects, which does not have format option, provides microsecond resolution even when some raw data has resolution less or more than that and adds today's date:

df2 = df.convert_objects(convert_dates='coerce')
df2
Out[32]: 
                     date    mesg
0  2015-09-17 10:00:00.000000     one
1  2015-09-17 10:00:00.000000     two
2  2015-09-17 10:00:00.000000   three
3  2015-09-17 10:00:00.000001    four
4  2015-09-17 10:00:00.000010    five
5  2015-09-17 10:00:00.000100     six
6  2015-09-17 10:00:00.001000   seven
7  2015-09-17 10:00:00.010000   eight
8  2015-09-17 10:00:00.100000    nine
9  2015-09-17 10:00:00.000000     ten
10 2015-09-17 10:00:00.000000  eleven
11 2015-09-17 10:00:00.000000  twelve

df2.dtypes
Out[33]: 
date    datetime64[ns]
mesg            object
dtype: object

Greater resolution of element values in a DataFrame column created from raw data some of which has greater than microsecond resolution is not recoverable with "%H:%M:%S.%f" formatting after datetime conversion done without an explicit format specifier (that is with DataFrame.convert_objects):

df2['date'] = pd.to_datetime(df2['date'],format="%H:%M:%S.%f")
df2
Out[34]: 
                         date    mesg
0  2015-09-17 10:00:00.000000     one
1  2015-09-17 10:00:00.000000     two
2  2015-09-17 10:00:00.000000   three
3  2015-09-17 10:00:00.000001    four
4  2015-09-17 10:00:00.000010    five
5  2015-09-17 10:00:00.000100     six
6  2015-09-17 10:00:00.001000   seven
7  2015-09-17 10:00:00.010000   eight
8  2015-09-17 10:00:00.100000    nine
9  2015-09-17 10:00:00.000000     ten
10 2015-09-17 10:00:00.000000  eleven
11 2015-09-17 10:00:00.000000  twelve

Formatting a DataFrame colume with "%H:%M:%S.%f" before datetime conversion provides nanosecond resolution if at least one element has a non-zero digit in the ninth place (as advertised in pandas.to_datetime documentation), but also increases the resolution of raw data with less than nanosecond resolution to that level and adds 1900-01-01 as the date:

df3 = df.copy(deep=True)
df3['date'] = pd.to_datetime(df3['date'],format="%H:%M:%S.%f",coerce=True)
df3
Out[35]:
                            date    mesg
0  1900-01-01 10:00:00.000000001     one
1  1900-01-01 10:00:00.000000010     two
2  1900-01-01 10:00:00.000000100   three
3  1900-01-01 10:00:00.000001000    four
4  1900-01-01 10:00:00.000010000    five
5  1900-01-01 10:00:00.000100000     six
6  1900-01-01 10:00:00.001000000   seven
7  1900-01-01 10:00:00.010000000   eight
8  1900-01-01 10:00:00.100000000    nine
9  1900-01-01 10:00:00.000000001     ten
10 1900-01-01 10:00:00.000000002  eleven
11 1900-01-01 10:00:00.000000003  twelve

Formatting a DataFrame column with "%H:%M:%S.%f" adds zeros after the datum with the least significant non-zero digit after decimal point (over the whole column and zeros are added according to the position:zeros table above) and aligns the resolution of all other data with that even if doing so increases or decreases the resolution of some raw data:

df4 = pd.read_csv('test2.csv')
df4
Out[36]: 
                  date    mesg
0   10:00:00.000000000     one
1    10:00:00.00000000     two
2     10:00:00.0000000   three
3      10:00:00.000000    four
4       10:00:00.00000    five
5        10:00:00.0001     six
6          10:00:00.00   seven
7           10:00:00.0   eight
8            10:00:00.    nine
9   10:00:00.000000000     ten
10  10:00:00.000000000  eleven
11   10:00:00.00000000  twelve

df4['date'] = pd.to_datetime(df4['date'],format="%H:%M:%S.%f",coerce=True)
df4
Out[37]: 
                         date    mesg
0  1900-01-01 10:00:00.000000     one
1  1900-01-01 10:00:00.000000     two
2  1900-01-01 10:00:00.000000   three
3  1900-01-01 10:00:00.000000    four
4  1900-01-01 10:00:00.000000    five
5  1900-01-01 10:00:00.000100     six
6  1900-01-01 10:00:00.000000   seven
7  1900-01-01 10:00:00.000000   eight
8                         NaT    nine # nothing after decimal point in raw data
9  1900-01-01 10:00:00.000000     ten
10 1900-01-01 10:00:00.000000  eleven
11 1900-01-01 10:00:00.000000  twelve

When attempting this with the same DataFrame with but with dates included in the date column, the same thing happened:

df25
Out[38]: 
                             date    mesg
0   2015-09-10 10:00:00.000000000     one
1    2015-09-11 10:00:00.00000000     two
2     2015-09-12 10:00:00.0000000   three
3      2015-09-13 10:00:00.000000    four
4       2015-09-14 10:00:00.00000    five
5        2015-09-15 10:00:00.0001     six
6          2015-09-16 10:00:00.00   seven
7           2015-09-17 10:00:00.0   eight
8            2015-09-18 10:00:00.    nine
9   2015-09-19 10:00:00.000000000     ten
10  2015-09-20 10:00:00.000000000  eleven
11   2015-09-21 10:00:00.00000000  twelve

df25['date'] = pd.to_datetime(df25['date'],format="%Y-%m-%d %H:%M:%S.%f",coerce=True)
df25
Out[39]: 
                         date    mesg
0  2015-09-10 10:00:00.000000     one
1  2015-09-11 10:00:00.000000     two
2  2015-09-12 10:00:00.000000   three
3  2015-09-13 10:00:00.000000    four
4  2015-09-14 10:00:00.000000    five
5  2015-09-15 10:00:00.000100     six
6  2015-09-16 10:00:00.000000   seven
7  2015-09-17 10:00:00.000000   eight
8                         NaT    nine # nothing after decimal point in raw data
9  2015-09-19 10:00:00.000000     ten
10 2015-09-20 10:00:00.000000  eleven
11 2015-09-21 10:00:00.000000  twelve

When no raw datum has a non-zero significant digit after the decimal point, formatting with a DataFrame column "%H:%M:%S.%f" may uniformly provide just two zeros after the decimal point for all the data even when that increases or decreases the resolution of some raw data:

df5 = pd.read_csv('test3.csv')
df5
Out[40]: 
                  date    mesg
0         10:00:00.000     one
1           10:00:00.0     two
2         10:00:00.000   three
3         10:00:00.000    four
4          10:00:00.00    five
5         10:00:00.000     six
6          10:00:00.00   seven
7           10:00:00.0   eight
8           10:00:00.0    nine
9   10:00:00.000000000     ten
10        10:00:00.000  eleven
11        10:00:00.000  twelve

df5['date'] = pd.to_datetime(df5['date'],format="%H:%M:%S.%f",coerce=True)
df5
Out[41]: 
                  date    mesg
0  1900-01-01 10:00:00     one
1  1900-01-01 10:00:00     two
2  1900-01-01 10:00:00   three
3  1900-01-01 10:00:00    four
4  1900-01-01 10:00:00    five
5  1900-01-01 10:00:00     six
6  1900-01-01 10:00:00   seven
7  1900-01-01 10:00:00   eight
8  1900-01-01 10:00:00    nine
9  1900-01-01 10:00:00     ten
10 1900-01-01 10:00:00  eleven
11 1900-01-01 10:00:00  twelve

The same thing happened when doing this test with the same DataFrame but with dates included in the date column:

df45
Out[42]: 
                             date    mesg
0         2015-09-10 10:00:00.000     one
1           2015-09-11 10:00:00.0     two
2         2015-09-12 10:00:00.000   three
3         2015-09-13 10:00:00.000    four
4          2015-09-14 10:00:00.00    five
5         2015-09-15 10:00:00.000     six
6          2015-09-16 10:00:00.00   seven
7           2015-09-17 10:00:00.0   eight
8           2015-09-18 10:00:00.0    nine
9   2015-09-19 10:00:00.000000000     ten
10        2015-09-20 10:00:00.000  eleven
11        2015-09-21 10:00:00.000  twelve

df45['date'] = pd.to_datetime(df45['date'],format="%Y-%m-%d %H:%M:    %S.%f",coerce=True)
df45
Out[43]: 
                  date    mesg
0  2015-09-10 10:00:00     one
1  2015-09-11 10:00:00     two
2  2015-09-12 10:00:00   three
3  2015-09-13 10:00:00    four
4  2015-09-14 10:00:00    five
5  2015-09-15 10:00:00     six
6  2015-09-16 10:00:00   seven
7  2015-09-17 10:00:00   eight
8  2015-09-18 10:00:00    nine
9  2015-09-19 10:00:00     ten
10 2015-09-20 10:00:00  eleven
11 2015-09-21 10:00:00  twelve

Upvotes: 3

PaulDong
PaulDong

Reputation: 813

Sorry not enough rep to comment, so I shall just attempt my answer here. Completely agree with EdChum, it is a display issue. If you try:

pd.to_datetime ("10:00:00.00001",format="%H:%M:%S.%f")

The response should be:

Timestamp('1900-01-01 10:00:00.000010')

Upvotes: 0

Related Questions