user1140126
user1140126

Reputation: 2649

How to count total number of elements in each column in a tab delimited file

Given a tab delimited file, How to count total number of elements in each column? My file is ~6GB in size.

column  count   min max sum mean
80  29573061    2   40  855179253   28.92
81  28861459    2   40  802912711   27.82
82                  40  778234605   27.63
83  27479902    2   40              27.44
84  26800815        40  729443846   27.22
85  26127825    2       701704155   26.86

Output:

`column` has  6 items in it
`count` has 5 items in it
 and so on

Upvotes: 0

Views: 3183

Answers (4)

singer
singer

Reputation: 2636

Map reduce library is made for such big data processing. This is a good idea to use it. https://developers.google.com/appengine/docs/python/dataprocessing/overview#Downloading_the_Mapreduce_Library

Upvotes: 0

Cartroo
Cartroo

Reputation: 4343

Assuming you mean count the occurrences of non-whitespace text, something like this:

import collections

titles = None
counts = collections.defaultdict(int)
with open("file.txt") as fd:
    for line in fd:
        items = line.split("\t")
        if titles is None:
            if items:
                titles = [i.strip() for i in items]
            continue
        for i in xrange(min(len(titles), len(items))):
            if items[i].strip():
                counts[titles[i]] += 1

for column, count in counts.iteritems():
    print "Column %s has %d items" % (column, count)

Note that this code relies on no row having more elements than the title row, although it does handle shorter rows. I'm sure there are more concise ways using list comprehensions and the like, but I thought that a more verbose style might be clearer.

If you meant summing the values themselves or similar, you'll need to do something a little more clever with the items list. Purely by way of example, to find the total sum of the "count" column:

total_count = 0
with open("file.txt") as fd:
    for line in fd
        items = line.split("\t")
        try:
            total_count += int(items[1])
        except ValueError:
            pass

I'm not quite sure what you're trying to do, but hopefully this examples give you an idea of how to go about this sort of processing in a simple way.

Upvotes: 0

DSM
DSM

Reputation: 353059

How about something like:

import csv
from collections import Counter

counts = Counter()
with open("count.tsv", "rb") as fp:
    reader = csv.DictReader(fp, delimiter="\t")
    for row in reader:
        counts += Counter(k for k,v in row.items() if v.strip())

which gives

>>> counts
Counter({'column': 6, 'mean': 6, 'count': 5, 'max': 5, 'sum': 5, 'min': 4})

Although, as noted by @Cartroo, you might have to change the emptiness criterion depending on exactly what values you want to consider empty.

Upvotes: 2

cmhley
cmhley

Reputation: 66

I would do something like this:

  1. Read the file line by line
  2. Split each line by a tab delimiter ('\t')
    • For each split, iterate through each element of the list
      • Check for an empty string value
        • If there is an empty string, don't do anything
        • If there is not an empty string, increment that column's count

For the column counts, you could just use another list. For example counts is our list, counts[0] will be the number of elements of the column "column", counts[1] will be the number of elements of the column "count", etc.

Upvotes: 0

Related Questions