Reputation: 4465
I am often iterating of financial price data stored in csv file. Like the accessibility of using pandas datetime objects to subset and organize data when all of my analysis is done with relation to time.
My issue is that when I am reading a file into my computers memory, bothe the parse_dates
argument supplied top pandas.read_csv()
and the pandas.to_datetime()
are incredibly time costly.
Does anyone know a better way of dealing with time stamps in python that can be optimized to run over many files that need to be loaded into memory?
%timeit pd.read_csv("C:\\Users\\rgalbo\\Desktop\\Equity\\Data\\sp - working data\\30min--Adj\\AAPL_24.csv",index_col=[0])
10 loops, best of 3: 112 ms per loop
with parse_date = True
%timeit pd.read_csv("C:\\Users\\rgalbo\\Desktop\\Equity\\Data\\sp - working data\\30min--Adj\\AAPL_24.csv",index_col=[0],parse_dates=True)
1 loops, best of 3: 7.5 s per loop
Upvotes: 0
Views: 3273
Reputation: 525
after testing few options for loading & parsing a csv file with, 13,811,418 rows having, 98 unique date values, we arrived at the below snippet, and found out that if we pass the format
param with predefined date-format ('%m/%d/%Y' in our case) we could reach 2.52 s
with Pandas.0.15.3.
def to_date(dates, lookup=False, **args):
if lookup:
return dates.map({v: pd.to_datetime(v, **args) for v in dates.unique()})
return pd.to_datetime(dates, **args)
Upvotes: 1