rgalbo
rgalbo

Reputation: 4465

faster csv loading with datetime index pandas

I am often iterating of financial price data stored in csv file. Like the accessibility of using pandas datetime objects to subset and organize data when all of my analysis is done with relation to time.

My issue is that when I am reading a file into my computers memory, bothe the parse_dates argument supplied top pandas.read_csv() and the pandas.to_datetime() are incredibly time costly.

Does anyone know a better way of dealing with time stamps in python that can be optimized to run over many files that need to be loaded into memory?

%timeit pd.read_csv("C:\\Users\\rgalbo\\Desktop\\Equity\\Data\\sp - working data\\30min--Adj\\AAPL_24.csv",index_col=[0])

10 loops, best of 3: 112 ms per loop

with parse_date = True

%timeit pd.read_csv("C:\\Users\\rgalbo\\Desktop\\Equity\\Data\\sp - working data\\30min--Adj\\AAPL_24.csv",index_col=[0],parse_dates=True)

1 loops, best of 3: 7.5 s per loop

Upvotes: 0

Views: 3273

Answers (1)

Joshua Baboo
Joshua Baboo

Reputation: 525

after testing few options for loading & parsing a csv file with, 13,811,418 rows having, 98 unique date values, we arrived at the below snippet, and found out that if we pass the format param with predefined date-format ('%m/%d/%Y' in our case) we could reach 2.52 s with Pandas.0.15.3.

def to_date(dates, lookup=False, **args):
    if lookup:
        return dates.map({v: pd.to_datetime(v, **args) for v in dates.unique()})
    return pd.to_datetime(dates, **args)
  • also use coerce=True (or coarse='raise' in later versions) for enabling date-format validation, other-wise the error values are retained as string-value, and will lead to an error when any other date-time operation is performed on the dataframe column

Upvotes: 1

Related Questions