Reputation: 3146
Is it possible to represent a pd.to_datetime with suffix of zeros? It seems the zeros are being removed.
print pd.to_datetime("2000-07-26 14:21:00.00000",
format="%Y-%m-%d %H:%M:%S.%f")
The result is
2000-07-26 14:21:00
The desired result would be
2000-07-26 14:21:00.00000
I know the values mean the same thing but it would be nice for consistency.
Upvotes: 3
Views: 4938
Reputation:
Doing some testing shows that when formatting date-time data with format="%H:%M:%S.%f", %f is capable of nanosecond resolution provided the ninth digit after the decimal place is non-zero. When formatting a string, a variable number of trailing zeros from none to five are added depending on the position of the least significant digit after the decimal point and given that its also the final digit. Here's a table of that from test data where position is the position of the least significant non-zero and also final digit and zeros is the number of trailing zeros added by formatting:
position zeros
9 0
8 1
7 2
6 0
5 1
4 2
3 3
2 4
1 5
When a column is formatted with "%H:%M:%S.%f" as a whole, all of its elements will have the same number of digits after the decimal point, which may be done by adding or removing trailing zeros even if that increases or decreases the resolution of raw data. I guess reasons for this are consistency and pleasing aesthetics without usually introducing excessive error, since in numeric calculations trailing zeros don't typically affect immediate results, however they can affect estimation of their error and how they should be presented (Trailing Zeros, Rules for Significant Figures).
Below are some observations of applying "%H:%M:%S.%f" format to individual strings and pandas.Series (DataFrame columns) with pandas.to_datetime and applying pandas.DataFrame.convert_objects(convert_dates='coerce') to DataFrames with a column that could be converted to datetime.
On a string pandas preserves a non-zero digit in up to the ninth decimal place in time conversion with "%H:%M:%S.%f" and adds a date if a one was not provided:
import pandas as pd
pd.to_datetime ("10:00:00.000000001",format="%H:%M:%S.%f")
Out[15]: Timestamp('1900-01-01 10:00:00.000000001')
pd.to_datetime ("2015-09-17 10:00:00.000000001",format="%Y-%m-%d %H:%M:%S.%f")
Out[15]: Timestamp('2015-09-17 10:00:00.000000001')
Prior to that and for tests in which the final non-zero digit is the final digit, it adds up to five trailing zeros after the final non-zero digit increasing the resolution of the raw data except when the final non-zero digit is in position six to the right of the decimal place:
pd.to_datetime ("10:00:00.00000001",format="%H:%M:%S.%f")
Out[15]: Timestamp('1900-01-01 10:00:00.000000010')
pd.to_datetime ("2015-09-17 10:00:00.00000001",format="%Y-%m-%d %H:%M:%S.%f")
Out[16]: Timestamp('2015-09-17 10:00:00.000000010')
pd.to_datetime ("10:00:00.0000001",format="%H:%M:%S.%f")
Out[15]: Timestamp('1900-01-01 10:00:00.000000100')
pd.to_datetime ("2015-09-17 10:00:00.0000001",format="%Y-%m-%d %H:%M:%S.%f")
Out[17]: Timestamp('2015-09-17 10:00:00.000000100')
pd.to_datetime ("10:00:00.000001",format="%H:%M:%S.%f")
Out[33]: Timestamp('1900-01-01 10:00:00.000001')
pd.to_datetime ("2015-09-17 10:00:00.000001",format="%Y-%m-%d %H:%M:%S.%f")
Out[18]: Timestamp('2015-09-17 10:00:00.000001')
pd.to_datetime ("10:00:00.00001",format="%H:%M:%S.%f")
Out[6]: Timestamp('1900-01-01 10:00:00.000010')
pd.to_datetime ("2015-09-17 10:00:00.00001",format="%Y-%m-%d %H:%M:%S.%f")
Out[19]: Timestamp('2015-09-17 10:00:00.000010')
pd.to_datetime ("10:00:00.0001",format="%H:%M:%S.%f")
Out[9]: Timestamp('1900-01-01 10:00:00.000100')
pd.to_datetime ("2015-09-17 10:00:00.0001",format="%Y-%m-%d %H:%M:%S.%f")
Out[21]: Timestamp('2015-09-17 10:00:00.000100')
pd.to_datetime ("10:00:00.001",format="%H:%M:%S.%f")
Out[10]: Timestamp('1900-01-01 10:00:00.001000')
pd.to_datetime ("2015-09-17 10:00:00.001",format="%Y-%m-%d %H:%M:%S.%f")
Out[22]: Timestamp('2015-09-17 10:00:00.001000')
pd.to_datetime ("10:00:00.01",format="%H:%M:%S.%f")
Out[12]: Timestamp('1900-01-01 10:00:00.010000')
pd.to_datetime ("2015-09-17 10:00:00.01",format="%Y-%m-%d %H:%M:%S.%f")
Out[24]: Timestamp('2015-09-17 10:00:00.010000'
pd.to_datetime ("10:00:00.1",format="%H:%M:%S.%f")
Out[13]: Timestamp('1900-01-01 10:00:00.100000')
pd.to_datetime ("2015-09-17 10:00:00.1",format="%Y-%m-%d %H:%M:%S.%f")
Out[26]: Timestamp('2015-09-17 10:00:00.100000')
Let see how it works with a DataFrame:
!type test.csv # here type is Windows substitute for Linux cat command
date,mesg
10:00:00.000000001,one
10:00:00.00000001,two
10:00:00.0000001,three
10:00:00.000001,four
10:00:00.00001,five
10:00:00.0001,six
10:00:00.001,seven
10:00:00.01,eight
10:00:00.1,nine
10:00:00.000000001,ten
10:00:00.000000002,eleven
10:00:00.000000003,twelve
df = pd.read_csv('test.csv')
df
Out[30]:
date mesg
0 10:00:00.000000001 one
1 10:00:00.00000001 two
2 10:00:00.0000001 three
3 10:00:00.000001 four
4 10:00:00.00001 five
5 10:00:00.0001 six
6 10:00:00.001 seven
7 10:00:00.01 eight
8 10:00:00.1 nine
9 10:00:00.000000001 ten
10 10:00:00.000000002 eleven
11 10:00:00.000000003 twelve
df.dtypes
Out[31]:
date object
mesg object
dtype: object
Datetime conversion of a DataFrame with convert_objects, which does not have format option, provides microsecond resolution even when some raw data has resolution less or more than that and adds today's date:
df2 = df.convert_objects(convert_dates='coerce')
df2
Out[32]:
date mesg
0 2015-09-17 10:00:00.000000 one
1 2015-09-17 10:00:00.000000 two
2 2015-09-17 10:00:00.000000 three
3 2015-09-17 10:00:00.000001 four
4 2015-09-17 10:00:00.000010 five
5 2015-09-17 10:00:00.000100 six
6 2015-09-17 10:00:00.001000 seven
7 2015-09-17 10:00:00.010000 eight
8 2015-09-17 10:00:00.100000 nine
9 2015-09-17 10:00:00.000000 ten
10 2015-09-17 10:00:00.000000 eleven
11 2015-09-17 10:00:00.000000 twelve
df2.dtypes
Out[33]:
date datetime64[ns]
mesg object
dtype: object
Greater resolution of element values in a DataFrame column created from raw data some of which has greater than microsecond resolution is not recoverable with "%H:%M:%S.%f" formatting after datetime conversion done without an explicit format specifier (that is with DataFrame.convert_objects):
df2['date'] = pd.to_datetime(df2['date'],format="%H:%M:%S.%f")
df2
Out[34]:
date mesg
0 2015-09-17 10:00:00.000000 one
1 2015-09-17 10:00:00.000000 two
2 2015-09-17 10:00:00.000000 three
3 2015-09-17 10:00:00.000001 four
4 2015-09-17 10:00:00.000010 five
5 2015-09-17 10:00:00.000100 six
6 2015-09-17 10:00:00.001000 seven
7 2015-09-17 10:00:00.010000 eight
8 2015-09-17 10:00:00.100000 nine
9 2015-09-17 10:00:00.000000 ten
10 2015-09-17 10:00:00.000000 eleven
11 2015-09-17 10:00:00.000000 twelve
Formatting a DataFrame colume with "%H:%M:%S.%f" before datetime conversion provides nanosecond resolution if at least one element has a non-zero digit in the ninth place (as advertised in pandas.to_datetime documentation), but also increases the resolution of raw data with less than nanosecond resolution to that level and adds 1900-01-01 as the date:
df3 = df.copy(deep=True)
df3['date'] = pd.to_datetime(df3['date'],format="%H:%M:%S.%f",coerce=True)
df3
Out[35]:
date mesg
0 1900-01-01 10:00:00.000000001 one
1 1900-01-01 10:00:00.000000010 two
2 1900-01-01 10:00:00.000000100 three
3 1900-01-01 10:00:00.000001000 four
4 1900-01-01 10:00:00.000010000 five
5 1900-01-01 10:00:00.000100000 six
6 1900-01-01 10:00:00.001000000 seven
7 1900-01-01 10:00:00.010000000 eight
8 1900-01-01 10:00:00.100000000 nine
9 1900-01-01 10:00:00.000000001 ten
10 1900-01-01 10:00:00.000000002 eleven
11 1900-01-01 10:00:00.000000003 twelve
Formatting a DataFrame column with "%H:%M:%S.%f" adds zeros after the datum with the least significant non-zero digit after decimal point (over the whole column and zeros are added according to the position:zeros table above) and aligns the resolution of all other data with that even if doing so increases or decreases the resolution of some raw data:
df4 = pd.read_csv('test2.csv')
df4
Out[36]:
date mesg
0 10:00:00.000000000 one
1 10:00:00.00000000 two
2 10:00:00.0000000 three
3 10:00:00.000000 four
4 10:00:00.00000 five
5 10:00:00.0001 six
6 10:00:00.00 seven
7 10:00:00.0 eight
8 10:00:00. nine
9 10:00:00.000000000 ten
10 10:00:00.000000000 eleven
11 10:00:00.00000000 twelve
df4['date'] = pd.to_datetime(df4['date'],format="%H:%M:%S.%f",coerce=True)
df4
Out[37]:
date mesg
0 1900-01-01 10:00:00.000000 one
1 1900-01-01 10:00:00.000000 two
2 1900-01-01 10:00:00.000000 three
3 1900-01-01 10:00:00.000000 four
4 1900-01-01 10:00:00.000000 five
5 1900-01-01 10:00:00.000100 six
6 1900-01-01 10:00:00.000000 seven
7 1900-01-01 10:00:00.000000 eight
8 NaT nine # nothing after decimal point in raw data
9 1900-01-01 10:00:00.000000 ten
10 1900-01-01 10:00:00.000000 eleven
11 1900-01-01 10:00:00.000000 twelve
When attempting this with the same DataFrame with but with dates included in the date column, the same thing happened:
df25
Out[38]:
date mesg
0 2015-09-10 10:00:00.000000000 one
1 2015-09-11 10:00:00.00000000 two
2 2015-09-12 10:00:00.0000000 three
3 2015-09-13 10:00:00.000000 four
4 2015-09-14 10:00:00.00000 five
5 2015-09-15 10:00:00.0001 six
6 2015-09-16 10:00:00.00 seven
7 2015-09-17 10:00:00.0 eight
8 2015-09-18 10:00:00. nine
9 2015-09-19 10:00:00.000000000 ten
10 2015-09-20 10:00:00.000000000 eleven
11 2015-09-21 10:00:00.00000000 twelve
df25['date'] = pd.to_datetime(df25['date'],format="%Y-%m-%d %H:%M:%S.%f",coerce=True)
df25
Out[39]:
date mesg
0 2015-09-10 10:00:00.000000 one
1 2015-09-11 10:00:00.000000 two
2 2015-09-12 10:00:00.000000 three
3 2015-09-13 10:00:00.000000 four
4 2015-09-14 10:00:00.000000 five
5 2015-09-15 10:00:00.000100 six
6 2015-09-16 10:00:00.000000 seven
7 2015-09-17 10:00:00.000000 eight
8 NaT nine # nothing after decimal point in raw data
9 2015-09-19 10:00:00.000000 ten
10 2015-09-20 10:00:00.000000 eleven
11 2015-09-21 10:00:00.000000 twelve
When no raw datum has a non-zero significant digit after the decimal point, formatting with a DataFrame column "%H:%M:%S.%f" may uniformly provide just two zeros after the decimal point for all the data even when that increases or decreases the resolution of some raw data:
df5 = pd.read_csv('test3.csv')
df5
Out[40]:
date mesg
0 10:00:00.000 one
1 10:00:00.0 two
2 10:00:00.000 three
3 10:00:00.000 four
4 10:00:00.00 five
5 10:00:00.000 six
6 10:00:00.00 seven
7 10:00:00.0 eight
8 10:00:00.0 nine
9 10:00:00.000000000 ten
10 10:00:00.000 eleven
11 10:00:00.000 twelve
df5['date'] = pd.to_datetime(df5['date'],format="%H:%M:%S.%f",coerce=True)
df5
Out[41]:
date mesg
0 1900-01-01 10:00:00 one
1 1900-01-01 10:00:00 two
2 1900-01-01 10:00:00 three
3 1900-01-01 10:00:00 four
4 1900-01-01 10:00:00 five
5 1900-01-01 10:00:00 six
6 1900-01-01 10:00:00 seven
7 1900-01-01 10:00:00 eight
8 1900-01-01 10:00:00 nine
9 1900-01-01 10:00:00 ten
10 1900-01-01 10:00:00 eleven
11 1900-01-01 10:00:00 twelve
The same thing happened when doing this test with the same DataFrame but with dates included in the date column:
df45
Out[42]:
date mesg
0 2015-09-10 10:00:00.000 one
1 2015-09-11 10:00:00.0 two
2 2015-09-12 10:00:00.000 three
3 2015-09-13 10:00:00.000 four
4 2015-09-14 10:00:00.00 five
5 2015-09-15 10:00:00.000 six
6 2015-09-16 10:00:00.00 seven
7 2015-09-17 10:00:00.0 eight
8 2015-09-18 10:00:00.0 nine
9 2015-09-19 10:00:00.000000000 ten
10 2015-09-20 10:00:00.000 eleven
11 2015-09-21 10:00:00.000 twelve
df45['date'] = pd.to_datetime(df45['date'],format="%Y-%m-%d %H:%M: %S.%f",coerce=True)
df45
Out[43]:
date mesg
0 2015-09-10 10:00:00 one
1 2015-09-11 10:00:00 two
2 2015-09-12 10:00:00 three
3 2015-09-13 10:00:00 four
4 2015-09-14 10:00:00 five
5 2015-09-15 10:00:00 six
6 2015-09-16 10:00:00 seven
7 2015-09-17 10:00:00 eight
8 2015-09-18 10:00:00 nine
9 2015-09-19 10:00:00 ten
10 2015-09-20 10:00:00 eleven
11 2015-09-21 10:00:00 twelve
Upvotes: 3
Reputation: 813
Sorry not enough rep to comment, so I shall just attempt my answer here. Completely agree with EdChum, it is a display issue. If you try:
pd.to_datetime ("10:00:00.00001",format="%H:%M:%S.%f")
The response should be:
Timestamp('1900-01-01 10:00:00.000010')
Upvotes: 0