Pandas parse dates from three fields with milliseconds

Question

I've got a csv file with multiple columns one of which is the date, the next one is the time and then there is one with the miliseconds like so:

    DATE        TIME        MSEC 
0   13/01/2017  08:49:37    805102    
1   13/01/2017  08:49:09    675839    
2   13/01/2017  08:39:03    50614    
3   13/01/2017  08:38:30    383081

Now I'm really impressed that I can use pandas to combine the date and the time when reading the csv file like

pd.read_csv(r"~/file.csv", parse_dates=[["DATE","TIME"]])

but I can't seem to squeeze in the milliseconds as well. As the file is quite large I'd be greatful if this could be done in read_csv as well rather then in a consecutive step.

MaxU - stand with Ukraine · Accepted Answer

UPDATE: parsing date from different columns on the fly when reading CSV:

In [181]: pd.read_csv(fn, 
                      date_parser=lambda d,t,ms: d + ' ' + t + '.' + ms, 
                      parse_dates={'Timestamp':['DATE','TIME','MSEC']})
Out[181]:
                   Timestamp
0 2017-01-13 08:49:37.805102
1 2017-01-13 08:49:09.675839
2 2017-01-13 08:39:03.506140
3 2017-01-13 08:38:30.383081

OLD answer:

First read your CSV as it is:

df = pd.read_csv(r"~/file.csv")

In [170]: df
Out[170]:
         DATE      TIME    MSEC
0  13/01/2017  08:49:37  805102
1  13/01/2017  08:49:09  675839
2  13/01/2017  08:39:03   50614
3  13/01/2017  08:38:30  383081

In [171]: df.dtypes
Out[171]:
DATE    object
TIME    object
MSEC     int64
dtype: object

now we can convert it:

In [172]: df['TimeStamp'] = pd.to_datetime(df.DATE + ' ' + df.TIME + '.' + df.MSEC.astype(str), format='%d/%m/%Y %H:%M:%S.%f')

In [173]: df
Out[173]:
         DATE      TIME    MSEC                  TimeStamp
0  13/01/2017  08:49:37  805102 2017-01-13 08:49:37.805102
1  13/01/2017  08:49:09  675839 2017-01-13 08:49:09.675839
2  13/01/2017  08:39:03   50614 2017-01-13 08:39:03.506140
3  13/01/2017  08:38:30  383081 2017-01-13 08:38:30.383081


In [174]: df.dtypes
Out[174]:
DATE                 object
TIME                 object
MSEC                  int64
TimeStamp    datetime64[ns]
dtype: object

TIMING:

In [186]: df = pd.concat([df] * 10**3, ignore_index=True)

In [187]: df.shape
Out[187]: (4000, 3)

In [188]: df.to_csv(fn, index=False)

In [189]: pd.options.display.max_rows = 6

In [190]: df
Out[190]:
            DATE      TIME    MSEC
0     13/01/2017  08:49:37  805102
1     13/01/2017  08:49:09  675839
2     13/01/2017  08:39:03   50614
...          ...       ...     ...
3997  13/01/2017  08:49:09  675839
3998  13/01/2017  08:39:03   50614
3999  13/01/2017  08:38:30  383081

[4000 rows x 3 columns]

In [191]: %%timeit
     ...: pd.read_csv(fn,
     ...:             date_parser=lambda d,t,ms: d + ' ' + t + '.' + ms,
     ...:             parse_dates={'Timestamp':['DATE','TIME','MSEC']})
     ...:
1 loop, best of 3: 3.31 s per loop

In [192]: %%timeit
     ...: df = pd.read_csv(fn)
     ...: df['TimeStamp'] = pd.to_datetime(df.pop('DATE') + ' ' +
     ...:                                  df.pop('TIME') + '.' +
     ...:                                  df.pop('MSEC').astype(str),
     ...:                                  format='%d/%m/%Y %H:%M:%S.%f')
     ...:
10 loops, best of 3: 122 ms per loop

Conclusion: reading CSV as it is and parsing date from the DataFrame was 27 times faster for 4.000 rows data set.

Pandas parse dates from three fields with milliseconds

Answers (1)

Related Questions