tommylicious
tommylicious

Reputation: 329

Process Large (10gb) Time Series CSV file into daily files

I am new to Python 3, coming over from R.

I have a very large time series file (10gb) which spans 6 months. It is a csv file where each row contains 6 fields: Date, Time, Data1, Data2, Data3, Data4. "Data" fields are numeric. I would like to iterate through the file and create & write individual files which contain only one day of data. The individual dates are known only by the fact that the date field suddenly changes. Ie, they don't include weekends, certain holidays, as well as random closures due to unforseen events so the vector of unique dates is not deterministic. Also, the number of lines per day is also variable and unknown.

I envision reading each line into a buffer and comparing the date to the previous date.

If the next date = previous date, I append that line to the buffer. I repeat this until next date != previous date, at which point I write the buffer to a new csv file which contains only that day's data (00:00:00 to 23:59:59).

I had trouble appending the new lines with pandas dataframes, and using readline into a list just got too mangled for me. Looking for Pythonic advice.

Upvotes: 1

Views: 852

Answers (2)

tommylicious
tommylicious

Reputation: 329

I was getting thrown off in that open(...) actually gets a line. I was doing a separate readline(...) after the open(...)and so unwittingly advancing the iterator and getting bad results.

There is a small problem with csv write which I'll post on new question.

Upvotes: 0

tdelaney
tdelaney

Reputation: 77357

pandas isn't a good option here because it reads the enire CSV. The standard csv module iterates line by line and will work better for you. Its pretty straightforward to write nested for loops to read each row and write, but you get extra points if you leverage iterators for shorter code.

itertools.groupby is interesting because it implements the check for a new date for you. After being handed an iterator, it returns iterators that stop whenever a key like the date changes. Those iterators can be consumed by a csv writer.

import csv
import itertools

with open('test_in.csv') as in_fp:
    reader = csv.reader(in_fp)
    for date, row_iter in itertools.groupby(reader, key=lambda row: row[0]):
        out_filename = date.replace('/','-') + '.csv' # todo: name your output file
        with open(out_filename, 'w') as out_fp:
            csv.writer(out_fp).writerows(row_iter)

Upvotes: 2

Related Questions