Reputation: 329
I am new to Python 3, coming over from R.
I have a very large time series file (10gb) which spans 6 months. It is a csv file where each row contains 6 fields: Date, Time, Data1, Data2, Data3, Data4. "Data" fields are numeric. I would like to iterate through the file and create & write individual files which contain only one day of data. The individual dates are known only by the fact that the date field suddenly changes. Ie, they don't include weekends, certain holidays, as well as random closures due to unforseen events so the vector of unique dates is not deterministic. Also, the number of lines per day is also variable and unknown.
I envision reading each line into a buffer and comparing the date to the previous date.
If the next date = previous date, I append that line to the buffer. I repeat this until next date != previous date, at which point I write the buffer to a new csv file which contains only that day's data (00:00:00 to 23:59:59).
I had trouble appending the new lines with pandas dataframes, and using readline into a list just got too mangled for me. Looking for Pythonic advice.
Upvotes: 1
Views: 852
Reputation: 329
I was getting thrown off in that open(...) actually gets a line. I was doing a separate readline(...) after the open(...)and so unwittingly advancing the iterator and getting bad results.
There is a small problem with csv write which I'll post on new question.
Upvotes: 0
Reputation: 77357
pandas
isn't a good option here because it reads the enire CSV. The standard csv
module iterates line by line and will work better for you. Its pretty straightforward to write nested for
loops to read each row and write, but you get extra points if you leverage iterators for shorter code.
itertools.groupby
is interesting because it implements the check for a new date for you. After being handed an iterator, it returns iterators that stop whenever a key like the date changes. Those iterators can be consumed by a csv writer.
import csv
import itertools
with open('test_in.csv') as in_fp:
reader = csv.reader(in_fp)
for date, row_iter in itertools.groupby(reader, key=lambda row: row[0]):
out_filename = date.replace('/','-') + '.csv' # todo: name your output file
with open(out_filename, 'w') as out_fp:
csv.writer(out_fp).writerows(row_iter)
Upvotes: 2