I was trying to process my huge CSV file (more than 20G), but the process was killed when reading the whole CSV file into memory. To avoid this issue, I am trying to read the second column line by line. For example, the 2nd column contains data like xxx, computer is good xxx, build algorithm import collections wordcount = collections.Counter() with open('desc.csv', 'rb') as infile: for line in infile: wordcount.update(line.split()) My code is working for the whole columns, how to only read the second column without using CSV reader?

Reputation: 601

read data from a huge CSV file efficiently

I was trying to process my huge CSV file (more than 20G), but the process was killed when reading the whole CSV file into memory. To avoid this issue, I am trying to read the second column line by line.

For example, the 2nd column contains data like

xxx, computer is good

xxx, build algorithm

import collections

wordcount = collections.Counter()

with open('desc.csv', 'rb') as infile:
    for line in infile:
         wordcount.update(line.split())

My code is working for the whole columns, how to only read the second column without using CSV reader?

Upvotes: 1

Answers (2)

Mike Robins

Reputation: 1773

It looks like the code in your question is reading the 20G file and splitting each line into space separated tokens then creating a counter that keeps a count of every unique token. I'd say that is where your memory is going.

From the manual csv.reader is an iterator

a reader object which will iterate over lines in the given csvfile. csvfile can be any object which supports the iterator protocol and returns a string each time its next() method is called

so it is fine to iterate through a huge file using csv.reader.

import collections

wordcount = collections.Counter()

with open('desc.csv', 'rb') as infile:
    for row in csv.reader(infile):
        # count words in strings from second column
        wordcount.update(row[1].split())

Upvotes: 1

bjornruffians

Reputation: 701

As far as I know, calling csv.reader(infile) opens and reads the whole file...which is where your problem lies.

You can just read line-by-line and parse manually:

X=[]

with open('desc.csv', 'r') as infile:    
   for line in infile:
      # Split on comma first
      cols = [x.strip() for x in line.split(',')]

      # Grab 2nd "column"
      col2 = cols[1]

      # Split on spaces
      words = [x.strip() for x in col2.split(' ')]
      for word in words:     
         if word not in X:
            X.append(word)

for w in X:
   print w

That will keep a smaller chunk of the file in memory at a given time (one line). However, you may still potentially have problems with variable X increasing to quite a large size, such that the program will error out due to memory limits. Depends how many unique words are in your "vocabulary" list

Upvotes: 1

read data from a huge CSV file efficiently

Answers (2)

Related Questions