kevbonham
kevbonham

Reputation: 1040

Iterate through csv by column

I have a bunch of large (~4 million values) csv files, and I need to take each column and create a file that organizes the values in a way that can be interpreted by a different program. The columns are quite different in length (between 2 million and 1000 values), and each csv may have between 4 and 100 columns.

I can load the whole thing into a pandas.DataFrame and then iterate through the series, but it's pretty slow:

import pandas as pd
import re
import os
for f in os.listdir(folder):
    gc = pd.read_csv('{}/{}'.format(folder, f))
    strain = f[:-7] # files have regular name structure, this just gets the name

    with open('{}.txt'.format(strain), 'w+') as out_handle:
        for column in gc:
            series = gc[column]
            for i in range(len(series))[::10]:
                pos = i + 1
                gc_cont = s[i]
                if pd.isnull(gc_cont):
                    continue
                out_handle.write('{} {}'.format(pos, gc_cont) 
                # I'm writing other info, but it's not important here 

Maybe there's a large performance cost from filling the smaller columns with a million + NaN values and loading the whole thing into memory? In any case, I think it would be significantly more efficient to read column by column, but I can't find a way to do that.

Pandas can do chunk size (docs), but that's chunking rows. If I write row by row, I'd either have to have 4-100 files open at once, or iterate through the original file that many times to write each separate column. Are either of these approaches appropriate or is there something I'm missing?

Upvotes: 1

Views: 853

Answers (3)

dermen
dermen

Reputation: 5362

What about the usecols option to read_csv ? Also, you can consider the squeeze option to return a pandas.Series which might be faster if you are only working with single columns anyway. Something like

cols = ['col0', 'col1', 'col2'] # the columns you want to load
for col in cols:
    data = pandas.read_csv(..., usecols=[col], squeeze=True)
    # format column data etc.

Here are the docs

usecols : array-like

Return a subset of the columns. Results in much faster parsing time and lower memory usage.

squeeze : boolean, default False

If the parsed data only contains one column then return a Series

Upvotes: 1

strubbly
strubbly

Reputation: 3477

How about reading the whole file into a string and wrapping it in a StringIO (or BytesIO depending on Python 2/3)? Then use that as the csv file and iterate once per column.

Something like this:

with open('{}/{}'.format(folder, f)) as in_file:
    data = in_file.read()

for index in number_of_columns: # not sure how to derive this
    csv_file = csv.reader(StringIO.StringIO(data))
    for data_row in csv_file:
        # do what you want with data_row[index]

EDIT:

It seems this doesn't address the performance problems. In the light of your comments, I think the best approach for performance is going to be to open all 4 to 100 files at once and write through to them as you read. I don't think a modern OS will have any problem with that. This is algorithmically the simplest and as a bonus also minimises memory usage. The work it does in reading and parsing and writing is needed in any version. There is a possible risk of contention for the disk head I suppose but at a guess I don't think that's going to be a problem.

I think only testing will show if it works faster - it's not obvious.

So that would be something like

with open('{}/{}'.format(folder, f)) as in_file:
    csv_file = csv.reader(in_file)

    # open up the files needed and put them into file_list 

    for data_row in csv_file:
        for index,datum in data_row:
            if datum != "":
                file_list[index].write(datum)

I haven't exactly emulated your write scheme but I'm sure you get what I mean. Obviously you'll need a mechanism to find the right number of files (maybe look at the first line?), and close them afterwards etc.

Upvotes: 1

rgalbo
rgalbo

Reputation: 4465

The easiest way may just be to read the entire file into a pandas df and the write each column to their own file.

import pandas as pd
import os

for f in os.listdir(folder):
    gc = pd.read_csv('{}/{}'.format(folder, f))
    strain = f[:-7]

    for col in gc.columns:
        temp = gc.col
        temp.to_csv('new_path'+strain+col)

This way, even though you have a memory consuming operation you are just splitting up the larger frames into columns and creating their own files that will be easier to work with.

Upvotes: 1

Related Questions