Reputation: 1040
I have a bunch of large (~4 million values) csv files, and I need to take each column and create a file that organizes the values in a way that can be interpreted by a different program. The columns are quite different in length (between 2 million and 1000 values), and each csv may have between 4 and 100 columns.
I can load the whole thing into a pandas.DataFrame
and then iterate through the series, but it's pretty slow:
import pandas as pd
import re
import os
for f in os.listdir(folder):
gc = pd.read_csv('{}/{}'.format(folder, f))
strain = f[:-7] # files have regular name structure, this just gets the name
with open('{}.txt'.format(strain), 'w+') as out_handle:
for column in gc:
series = gc[column]
for i in range(len(series))[::10]:
pos = i + 1
gc_cont = s[i]
if pd.isnull(gc_cont):
continue
out_handle.write('{} {}'.format(pos, gc_cont)
# I'm writing other info, but it's not important here
Maybe there's a large performance cost from filling the smaller columns with a million + NaN
values and loading the whole thing into memory? In any case, I think it would be significantly more efficient to read column by column, but I can't find a way to do that.
Pandas can do chunk size
(docs), but that's chunking rows. If I write row by row, I'd either have to have 4-100 files open at once, or iterate through the original file that many times to write each separate column. Are either of these approaches appropriate or is there something I'm missing?
Upvotes: 1
Views: 853
Reputation: 5362
What about the usecols
option to read_csv
? Also, you can consider the squeeze
option to return a pandas.Series
which might be faster if you are only working with single columns anyway. Something like
cols = ['col0', 'col1', 'col2'] # the columns you want to load
for col in cols:
data = pandas.read_csv(..., usecols=[col], squeeze=True)
# format column data etc.
Here are the docs
usecols : array-like
Return a subset of the columns. Results in much faster parsing time and lower memory usage.
squeeze : boolean, default False
If the parsed data only contains one column then return a Series
Upvotes: 1
Reputation: 3477
How about reading the whole file into a string and wrapping it in a StringIO (or BytesIO depending on Python 2/3)? Then use that as the csv file and iterate once per column.
Something like this:
with open('{}/{}'.format(folder, f)) as in_file:
data = in_file.read()
for index in number_of_columns: # not sure how to derive this
csv_file = csv.reader(StringIO.StringIO(data))
for data_row in csv_file:
# do what you want with data_row[index]
EDIT:
It seems this doesn't address the performance problems. In the light of your comments, I think the best approach for performance is going to be to open all 4 to 100 files at once and write through to them as you read. I don't think a modern OS will have any problem with that. This is algorithmically the simplest and as a bonus also minimises memory usage. The work it does in reading and parsing and writing is needed in any version. There is a possible risk of contention for the disk head I suppose but at a guess I don't think that's going to be a problem.
I think only testing will show if it works faster - it's not obvious.
So that would be something like
with open('{}/{}'.format(folder, f)) as in_file:
csv_file = csv.reader(in_file)
# open up the files needed and put them into file_list
for data_row in csv_file:
for index,datum in data_row:
if datum != "":
file_list[index].write(datum)
I haven't exactly emulated your write scheme but I'm sure you get what I mean. Obviously you'll need a mechanism to find the right number of files (maybe look at the first line?), and close them afterwards etc.
Upvotes: 1
Reputation: 4465
The easiest way may just be to read the entire file into a pandas df and the write each column to their own file.
import pandas as pd
import os
for f in os.listdir(folder):
gc = pd.read_csv('{}/{}'.format(folder, f))
strain = f[:-7]
for col in gc.columns:
temp = gc.col
temp.to_csv('new_path'+strain+col)
This way, even though you have a memory consuming operation you are just splitting up the larger frames into columns and creating their own files that will be easier to work with.
Upvotes: 1