Reputation: 2649
Given a tab delimited file, How to count total number of elements in each column? My file is ~6GB in size.
column count min max sum mean
80 29573061 2 40 855179253 28.92
81 28861459 2 40 802912711 27.82
82 40 778234605 27.63
83 27479902 2 40 27.44
84 26800815 40 729443846 27.22
85 26127825 2 701704155 26.86
Output:
`column` has 6 items in it
`count` has 5 items in it
and so on
Upvotes: 0
Views: 3183
Reputation: 2636
Map reduce library is made for such big data processing. This is a good idea to use it. https://developers.google.com/appengine/docs/python/dataprocessing/overview#Downloading_the_Mapreduce_Library
Upvotes: 0
Reputation: 4343
Assuming you mean count the occurrences of non-whitespace text, something like this:
import collections
titles = None
counts = collections.defaultdict(int)
with open("file.txt") as fd:
for line in fd:
items = line.split("\t")
if titles is None:
if items:
titles = [i.strip() for i in items]
continue
for i in xrange(min(len(titles), len(items))):
if items[i].strip():
counts[titles[i]] += 1
for column, count in counts.iteritems():
print "Column %s has %d items" % (column, count)
Note that this code relies on no row having more elements than the title row, although it does handle shorter rows. I'm sure there are more concise ways using list comprehensions and the like, but I thought that a more verbose style might be clearer.
If you meant summing the values themselves or similar, you'll need to do something a little more clever with the items
list. Purely by way of example, to find the total sum of the "count" column:
total_count = 0
with open("file.txt") as fd:
for line in fd
items = line.split("\t")
try:
total_count += int(items[1])
except ValueError:
pass
I'm not quite sure what you're trying to do, but hopefully this examples give you an idea of how to go about this sort of processing in a simple way.
Upvotes: 0
Reputation: 353059
How about something like:
import csv
from collections import Counter
counts = Counter()
with open("count.tsv", "rb") as fp:
reader = csv.DictReader(fp, delimiter="\t")
for row in reader:
counts += Counter(k for k,v in row.items() if v.strip())
which gives
>>> counts
Counter({'column': 6, 'mean': 6, 'count': 5, 'max': 5, 'sum': 5, 'min': 4})
Although, as noted by @Cartroo, you might have to change the emptiness criterion depending on exactly what values you want to consider empty.
Upvotes: 2
Reputation: 66
I would do something like this:
For the column counts, you could just use another list. For example counts is our list, counts[0] will be the number of elements of the column "column", counts[1] will be the number of elements of the column "count", etc.
Upvotes: 0