Reputation: 715
I currently have a csv-file with 200k rows, each row including 80 entries, separated by a comma. I try to open the csv-file with open() and append the data to a 2-D python list. When I try to iterate through that list and append the 80 entries to a single one, the computer freezes. Does my code produce some kind of memory issue? Should I work with my data in batches or is there a more efficient way to got through what I'm trying to do?
In short: Open csv, go through 200k entries and transform them from [1, 2, 3, 4, 5,..., 80], [1, ..., 80], .... 200k -> [12345...80]. [1...80], 200k
import csv
# create empty shells
raw_data = []
concatenate_data = []
def get_data():
counter = 1
# open the raw data file and put it into a list
with open('raw_data_train.csv', 'r') as file:
reader = csv.reader(file, dialect='excel')
for row in reader:
print('\rCurrent item: {0}'.format(counter), end='', flush=True)
raw_data.append(row)
counter += 1
print('\nReading done')
def format_data():
counter = 1
temp = ''
# concatenate the separated letters for each string in the csv file
for batch in raw_data:
for letters in batch:
temp += letters
concatenate_data.append(temp)
print('\rCurrent item: {0}'.format(counter), end='', flush=True)
counter += 1
print('\nTransforming done')
print(concatenate_data[0:10])
Upvotes: 3
Views: 310
Reputation: 140168
format_data()
routine is bound to hog your CPU a lot:
string
concatenation which is sub-optimal as opposed to other methods (StringIO
, str.join
)temp
variable in the whole routinetemp
in the loop (appending basically a bigger and bigger string).I suppose you just want to do that: append all text as 1 string for each line, without spaces. Much faster done using str.join
to avoid string concatenation:
for batch in raw_data:
concatenate_data.append("".join(batch))
or even faster if you can get rid of the prints:
concatenate_data = ["".join(batch) for batch in raw_data]
Upvotes: 1