Abdul Haq
Abdul Haq

Reputation: 51

Count the frequency of words from a column in Python using external csv file

This question was asked before by user907629, and Maria Zverina answered the question, but she didn't import the data from external csv file.

My file contains more than 800000 records, and I want to import an external csv file. What changes should be done in this frequency count code?

Upvotes: 2

Views: 3367

Answers (3)

Padraic Cunningham
Padraic Cunningham

Reputation: 180542

You can do it without storing any intermediary lists:

import csv
from collections import Counter
from itertools import imap
from operator import  itemgetter

with open('yourcsv') as f:
    next(f) # skip the header
    cn = Counter(imap(itemgetter(2), csv.reader(f)))

    for t in cn.iteritems():
        print("{} appears {} times".format(*t))

There is no reason to store data in lists unless you plan on using the list, itemgetter will pull just the third column value from each row. You need to pass whatever column you want to count and set the delimiter to whatever delimits your data.

Upvotes: 4

Jos Polfliet
Jos Polfliet

Reputation: 141

If you only need to do this once and if you are using a UNIX machine you can make use of the excellent command line tools as well. Counting words would be as simple as

cat "inputfile.txt" | sort | uniq -c

To store those values in an output file use

cat "inputfile.txt" | sort | uniq -c > outputfile.txt

See http://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html for a discussion on when command line can be (up to 235x) faster and easier than an hadoop cluster.

Upvotes: -1

Ali Nikneshan
Ali Nikneshan

Reputation: 3502

  1. use open to read file externally instead of StringIO
  2. 800,000 is not so big that you consern about memory, so you can read it as original question. But if you think you need it for bigger file, you have to read it one by one.

Check the new code:

import csv
from collections import Counter


input_stream = open('external.csv')
reader = csv.reader(input_stream, delimiter='\t')

reader.next() #skip header
cities = [row[2] for row in reader]

for (k,v) in Counter(cities).iteritems():
    print "%s appears %d times" % (k, v)

Upvotes: 1

Related Questions