Reputation: 2573
I've got a csv file with multiple columns one of which is the date, the next one is the time and then there is one with the miliseconds like so:
DATE TIME MSEC
0 13/01/2017 08:49:37 805102
1 13/01/2017 08:49:09 675839
2 13/01/2017 08:39:03 50614
3 13/01/2017 08:38:30 383081
Now I'm really impressed that I can use pandas to combine the date and the time when reading the csv file like
pd.read_csv(r"~/file.csv", parse_dates=[["DATE","TIME"]])
but I can't seem to squeeze in the milliseconds as well. As the file is quite large I'd be greatful if this could be done in read_csv as well rather then in a consecutive step.
Upvotes: 1
Views: 1484
Reputation: 210882
UPDATE: parsing date from different columns on the fly when reading CSV:
In [181]: pd.read_csv(fn,
date_parser=lambda d,t,ms: d + ' ' + t + '.' + ms,
parse_dates={'Timestamp':['DATE','TIME','MSEC']})
Out[181]:
Timestamp
0 2017-01-13 08:49:37.805102
1 2017-01-13 08:49:09.675839
2 2017-01-13 08:39:03.506140
3 2017-01-13 08:38:30.383081
OLD answer:
First read your CSV as it is:
df = pd.read_csv(r"~/file.csv")
In [170]: df
Out[170]:
DATE TIME MSEC
0 13/01/2017 08:49:37 805102
1 13/01/2017 08:49:09 675839
2 13/01/2017 08:39:03 50614
3 13/01/2017 08:38:30 383081
In [171]: df.dtypes
Out[171]:
DATE object
TIME object
MSEC int64
dtype: object
now we can convert it:
In [172]: df['TimeStamp'] = pd.to_datetime(df.DATE + ' ' + df.TIME + '.' + df.MSEC.astype(str), format='%d/%m/%Y %H:%M:%S.%f')
In [173]: df
Out[173]:
DATE TIME MSEC TimeStamp
0 13/01/2017 08:49:37 805102 2017-01-13 08:49:37.805102
1 13/01/2017 08:49:09 675839 2017-01-13 08:49:09.675839
2 13/01/2017 08:39:03 50614 2017-01-13 08:39:03.506140
3 13/01/2017 08:38:30 383081 2017-01-13 08:38:30.383081
In [174]: df.dtypes
Out[174]:
DATE object
TIME object
MSEC int64
TimeStamp datetime64[ns]
dtype: object
TIMING:
In [186]: df = pd.concat([df] * 10**3, ignore_index=True)
In [187]: df.shape
Out[187]: (4000, 3)
In [188]: df.to_csv(fn, index=False)
In [189]: pd.options.display.max_rows = 6
In [190]: df
Out[190]:
DATE TIME MSEC
0 13/01/2017 08:49:37 805102
1 13/01/2017 08:49:09 675839
2 13/01/2017 08:39:03 50614
... ... ... ...
3997 13/01/2017 08:49:09 675839
3998 13/01/2017 08:39:03 50614
3999 13/01/2017 08:38:30 383081
[4000 rows x 3 columns]
In [191]: %%timeit
...: pd.read_csv(fn,
...: date_parser=lambda d,t,ms: d + ' ' + t + '.' + ms,
...: parse_dates={'Timestamp':['DATE','TIME','MSEC']})
...:
1 loop, best of 3: 3.31 s per loop
In [192]: %%timeit
...: df = pd.read_csv(fn)
...: df['TimeStamp'] = pd.to_datetime(df.pop('DATE') + ' ' +
...: df.pop('TIME') + '.' +
...: df.pop('MSEC').astype(str),
...: format='%d/%m/%Y %H:%M:%S.%f')
...:
10 loops, best of 3: 122 ms per loop
Conclusion: reading CSV as it is and parsing date from the DataFrame was 27 times faster for 4.000 rows data set.
Upvotes: 2