Reputation: 679
I want to write a script to process some data files. The data files are just ascii text with columns of data, here is a simple example...
The first column is an ID number, in this case from 1 to 3. The second column is a value of interest. (The actual files I'm using have many more IDs and values, but let's keep it simple here).
data.txt contents:
1 5
1 4
1 10
1 19
2 15
2 18
2 20
2 21
3 50
3 52
3 55
3 70
I want to iterate over the data and extract the values for each ID, and process them, i.e. get all values for ID 1 and do something with them, then get all values for ID 2 etc.
So I can write this in python.
#!/usr/bin/env python
def processValues(values):
print "Will do something with data here: ", values
f = open('data.txt', 'r')
datalines = f.readlines()
f.close()
currentID = 0
first = True
for line in datalines:
fields = line.split()
# if we've moved onto a new ID,
# then process the values we've collected so far
if (fields[0] != currentID):
# but if this is our first iteration, then
# we just need to initialise our ID variable
if (not first):
processValues(values) # do something useful
currentID = fields[0]
values = []
first = False
values.append(fields[1])
processValues(values) # do something with the last values
The problem I have is that processValues()
must be called again at the end. So this requires code duplication, and means that I might one day write a script like this and forget to put the extra processValues()
at the end, and therefore miss the last ID. It also requires storing whether it is our 'first' iteration, which is annoying.
Is there anyway to do this without having two function calls to processValues()
(one inside the loop for each new ID, one after the loop for the last ID)?
The only way I can think of is by storing the line number and checking in the loop if we're at the last line. But it seems that removes the point of the 'foreach' style processing where we store the line itself, and not the index or the total number of lines. This would also apply to other scripting languages like perl, where it would be common to iterate over lines with while(<FILE>)
and not have an idea of the number of lines remaining. Is it always necessary to write the function call again at the end?
Upvotes: 0
Views: 204
Reputation: 2809
With loadtxt()
it may go like this:
from numpy import loadtxt
data = loadtxt("data.txt")
ids = unique(data[:,0]).astype(int)
for id in ids:
d = data[ data[:,0] == id ]
# d is a reduced (matrix) containing data for <id>
# .......
# do some stuff with d
For your example print d
will give:
id= 1
d=
[[ 1. 5.]
[ 1. 4.]
[ 1. 10.]
[ 1. 19.]]
id= 2
d=
[[ 2. 15.]
[ 2. 18.]
[ 2. 20.]
[ 2. 21.]]
id= 3
d=
[[ 3. 50.]
[ 3. 52.]
[ 3. 55.]
[ 3. 70.]]
Upvotes: 1
Reputation: 142206
You want to look at itertools.groupby if all occurrences of a key are contigious - a basic example...
from itertools import groupby
from operator import itemgetter
with open('somefile.txt') as fin:
lines = ( line.split() for line in fin )
for key, values in groupby(lines, itemgetter(0)):
print 'Key', key, 'has values'
for value in values:
print value
Alternatively - you can also look at using a collections.defaultdict with a list
as the default.
Upvotes: 3