trev
trev

Reputation: 679

How can I iterate over a data file without code duplication in python?

I want to write a script to process some data files. The data files are just ascii text with columns of data, here is a simple example...

The first column is an ID number, in this case from 1 to 3. The second column is a value of interest. (The actual files I'm using have many more IDs and values, but let's keep it simple here).

data.txt contents:

1 5
1 4
1 10
1 19
2 15
2 18
2 20
2 21
3 50
3 52
3 55
3 70

I want to iterate over the data and extract the values for each ID, and process them, i.e. get all values for ID 1 and do something with them, then get all values for ID 2 etc.

So I can write this in python.

#!/usr/bin/env python

def processValues(values):
  print "Will do something with data here: ", values

f = open('data.txt', 'r')
datalines = f.readlines()
f.close()

currentID = 0
first = True

for line in datalines:
    fields = line.split()

    # if we've moved onto a new ID,
    # then process the values we've collected so far
    if (fields[0] != currentID):

        # but if this is our first iteration, then
        # we just need to initialise our ID variable
        if (not first):
            processValues(values) # do something useful

        currentID = fields[0]
        values = []
        first = False

    values.append(fields[1])

processValues(values) # do something with the last values

The problem I have is that processValues() must be called again at the end. So this requires code duplication, and means that I might one day write a script like this and forget to put the extra processValues() at the end, and therefore miss the last ID. It also requires storing whether it is our 'first' iteration, which is annoying.

Is there anyway to do this without having two function calls to processValues() (one inside the loop for each new ID, one after the loop for the last ID)?

The only way I can think of is by storing the line number and checking in the loop if we're at the last line. But it seems that removes the point of the 'foreach' style processing where we store the line itself, and not the index or the total number of lines. This would also apply to other scripting languages like perl, where it would be common to iterate over lines with while(<FILE>) and not have an idea of the number of lines remaining. Is it always necessary to write the function call again at the end?

Upvotes: 0

Views: 204

Answers (2)

Tengis
Tengis

Reputation: 2809

With loadtxt() it may go like this:

from numpy import loadtxt

data = loadtxt("data.txt")
ids = unique(data[:,0]).astype(int)

for id in ids:
    d = data[ data[:,0] == id ] 
    # d is a reduced (matrix) containing data for <id>
    # ....... 
    # do some stuff with d

For your example print d will give:

id= 1 
d=
[[  1.   5.]
 [  1.   4.]
 [  1.  10.]
 [  1.  19.]]
id= 2 
d=
[[  2.  15.]
 [  2.  18.]
 [  2.  20.]
 [  2.  21.]]
id= 3 
d=
[[  3.  50.]
 [  3.  52.]
 [  3.  55.]
 [  3.  70.]]

Upvotes: 1

Jon Clements
Jon Clements

Reputation: 142206

You want to look at itertools.groupby if all occurrences of a key are contigious - a basic example...

from itertools import groupby
from operator import itemgetter

with open('somefile.txt') as fin:
    lines = ( line.split() for line in fin )
    for key, values in groupby(lines, itemgetter(0)):
        print 'Key', key, 'has values'
        for value in values:
            print value

Alternatively - you can also look at using a collections.defaultdict with a list as the default.

Upvotes: 3

Related Questions