How to go easy on memory while iterating through a large CSV file in python?

Question

I currently have a csv-file with 200k rows, each row including 80 entries, separated by a comma. I try to open the csv-file with open() and append the data to a 2-D python list. When I try to iterate through that list and append the 80 entries to a single one, the computer freezes. Does my code produce some kind of memory issue? Should I work with my data in batches or is there a more efficient way to got through what I'm trying to do?

In short: Open csv, go through 200k entries and transform them from [1, 2, 3, 4, 5,..., 80], [1, ..., 80], .... 200k -> [12345...80]. [1...80], 200k

import csv


# create empty shells
raw_data = []
concatenate_data = []


def get_data():
    counter = 1

    # open the raw data file and put it into a list
    with open('raw_data_train.csv', 'r') as file:
        reader = csv.reader(file, dialect='excel')

        for row in reader:
            print('
Current item: {0}'.format(counter), end='', flush=True)
            raw_data.append(row)
            counter += 1

    print('
Reading done')


def format_data():
    counter = 1
    temp = ''

    # concatenate the separated letters for each string in the csv file
    for batch in raw_data:
        for letters in batch:
            temp += letters
        concatenate_data.append(temp)
        print('
Current item: {0}'.format(counter), end='', flush=True)
        counter += 1

    print('
Transforming done')
    print(concatenate_data[0:10])

Jean-Fran&#231;ois Fabre · Accepted Answer

format_data() routine is bound to hog your CPU a lot:

using string concatenation which is sub-optimal as opposed to other methods (StringIO, str.join)
using the same temp variable in the whole routine
appending temp in the loop (appending basically a bigger and bigger string).

I suppose you just want to do that: append all text as 1 string for each line, without spaces. Much faster done using str.join to avoid string concatenation:

for batch in raw_data:
    concatenate_data.append("".join(batch))

or even faster if you can get rid of the prints:

 concatenate_data = ["".join(batch) for batch in raw_data]

How to go easy on memory while iterating through a large CSV file in python?

Answers (1)

Related Questions