Reputation: 601
I was trying to process my huge CSV file (more than 20G), but the process was killed when reading the whole CSV file into memory. To avoid this issue, I am trying to read the second column line by line.
For example, the 2nd column contains data like
xxx, build algorithm
import collections
wordcount = collections.Counter()
with open('desc.csv', 'rb') as infile:
for line in infile:
wordcount.update(line.split())
My code is working for the whole columns, how to only read the second column without using CSV reader?
Upvotes: 1
Views: 1438
Reputation: 1773
It looks like the code in your question is reading the 20G file and splitting each line into space separated tokens then creating a counter that keeps a count of every unique token. I'd say that is where your memory is going.
From the manual csv.reader is an iterator
a reader object which will iterate over lines in the given csvfile. csvfile can be any object which supports the iterator protocol and returns a string each time its next() method is called
so it is fine to iterate through a huge file using csv.reader
.
import collections
wordcount = collections.Counter()
with open('desc.csv', 'rb') as infile:
for row in csv.reader(infile):
# count words in strings from second column
wordcount.update(row[1].split())
Upvotes: 1
Reputation: 701
As far as I know, calling csv.reader(infile)
opens and reads the whole file...which is where your problem lies.
You can just read line-by-line and parse manually:
X=[]
with open('desc.csv', 'r') as infile:
for line in infile:
# Split on comma first
cols = [x.strip() for x in line.split(',')]
# Grab 2nd "column"
col2 = cols[1]
# Split on spaces
words = [x.strip() for x in col2.split(' ')]
for word in words:
if word not in X:
X.append(word)
for w in X:
print w
That will keep a smaller chunk of the file in memory at a given time (one line). However, you may still potentially have problems with variable X
increasing to quite a large size, such that the program will error out due to memory limits. Depends how many unique words are in your "vocabulary" list
Upvotes: 1