Reputation: 4677
Given a file looks like this:
1440927 1
1727557 3
1440927 2
9917156 4
The first field is an ID which is in range(0, 200000000)
. The second field represents a type , which is in range(1, 5)
. And type 1 and type 2 belong to a common category S1
, while type 3 and type 4 belong to S2
. One single ID may have several records with different type. The file is about 200MB in size.
The problem is to count the number of IDs which has a record of type 1 or 2, and the number of IDs which has a record of type 3 or 4.
My code:
def gen(path):
line_count = 0
for line in open(path):
tmp = line.split()
id = int(tmp[0])
yield id, int(tmp[1])
max_id = 200000000
S1 = bitarray.bitarray(max_id)
S2 = bitarray.bitarray(max_id)
for id, type in gen(path):
if type != 3 and type != 4:
S1[id] = True
else:
S2[id] = True
print S1.count(), S2.count()
Although it gives the answer, I think it runs a little slowly. What should I do to make it run faster?
EDIT:
There are duplicated records in the file. And I only need to distinguish between S1(type 1 and type 2) and S2(type 3 and type 4). For example, 1440927 1
and 1440927 2
are counted only once but not twice because they belong to S1. So I have to store the IDs.
Upvotes: 7
Views: 209
Reputation: 414585
If there is enough memory you could use dict
instead of bitarray.bitarray
. It could be faster:
S1, S2 = {}, {} # dicts are slightly faster than `set()`
with open(path) as f:
for i, line in enumerate(f, 1):
id, sep, type = line.partition(" ")
if type == "1" or type == "2":
S1[id] = True
elif type == "3" or type == "4":
S2[id] = True
else:
print "WARNING: unknown type: %r in line %d: %r" % (type, i, line)
print len(S1), len(S2)
Or you could try to sort the lines first:
def gettype(line):
return line[-1]
S1, S2 = 0, 0
with open(path) as f:
lines = f.read().splitlines()
lines.sort(key=gettype)
for type, group in itertools.groupby(lines, gettype):
ids = (line.partition(" ")[0] for line in group)
if type == "1" or type == "2":
S1 += len(set(ids))
elif type == "3" or type == "4":
S2 += len(set(ids))
else:
assert 0, (type, list(ids))
print S1, S2
The asymptotic complexity of the second approach is worse.
You could use line_profiler to find out where your bottleneck is.
Upvotes: 2
Reputation: 107666
You're using a iterator over the file, this means that you only buffer a few lines at the time. Every time the buffer is empty the disk needs to seek and your program has to wait.
200MB easily fits into your memory, so getting all lines will speed things up:
def gen(path):
# load all the lines,
lines = open(path).readlines()
split = (line.split() for line in lines)
return ((int(x), int(y)) for x,y in split)
Upvotes: 3
Reputation: 213005
Are you tied to Python?
egrep -e "[12]$" filename.txt | cut -d " " -f 1 | sort -u | wc -l
egrep -e "[34]$" filename.txt | cut -d " " -f 1 | sort -u | wc -l
These two commands count you the number of occurences of ("1" or "2") and ("3" or "4") at the end of each line in your filename.txt while ignoring duplicate first fields.
Probably faster than Python…
Upvotes: 2