Reputation: 3849
Problem presentation
I have the following file (file is sorted according to all three columns):
D000001 D000001 1975
D000001 D000001 1976
D000001 D002413 1976
D000001 D002413 1979
D000001 D002413 1987
D000001 D004298 1976
D000002 D000002 1985
D000003 D000900 1975
D000003 D000900 1990
D000003 D004134 1983
D000003 D004134 1986
I need to count duplicate pairs (in the 1st and 2nd column) and to each such pair assign the lowest value from the 3rd column. For my toy file the output should be:
D000001 D000001 2 1975
D000001 D002413 3 1976
D000001 D004298 1 1976
D000002 D000002 1 1985
D000003 D000900 2 1975
D000003 D004134 2 1983
My questions
My initial attempt with the current output is as follows.
my_dict = {}
with open('test.srt') as infile:
for line in infile:
line = line.rstrip()
word1, word2, year = line.split('|')
year = int(year)
my_tuple = (word1, word2)
if my_tuple in my_dict:
freq += 1
my_dict[my_tuple] = (freq, year)
else:
freq = 1
my_dict[my_tuple] = (freq, year)
for key, value in my_dict.items():
print key[0], key[1], value
Current output:
D000001 D000001 (2, 1976) ## Should be 1976 etc.
D000001 D002413 (3, 1987)
D000001 D004298 (1, 1976)
D000002 D000002 (1, 1985)
D000003 D000900 (2, 1990)
D000003 D004134 (2, 1986)
Upvotes: 2
Views: 241
Reputation: 58617
TXR:
@(repeat)
@dleft @dright @num
@ (collect :gap 0 :vars (dupenum))
@dleft @dright @dupenum
@ (end)
@ (output)
@dleft @dright @(+ 1 (length dupenum)) @num
@ (end)
@(end)
Run:
$ txr data.txr data
D000001 D000001 2 1975
D000001 D002413 3 1976
D000001 D004298 1 1976
D000002 D000002 1 1985
D000003 D000900 2 1975
D000003 D004134 2 1983
AWK:
$1 != old1 || $2 != old2 { printf("%s", out);
count = 0
old1 = $1
old2 = $2
old3 = $3 }
{ out = $1 " " $2 " " ++count " " old3 "\n" }
END { printf("%s", out); }
$ awk -f data.awk data
D000001 D000001 2 1975
D000001 D002413 3 1976
D000001 D004298 1 1976
D000002 D000002 1 1985
D000003 D000900 2 1975
D000003 D004134 2 1983
TXR Lisp functional one-liner:
$ txr -t '[(opip (mapcar* (op split-str @1 " "))
(partition-by [callf list first second])
(mapcar* (aret `@[@1 0..2] @(+ 1 (length @rest)) @[@1 2]`)))
(get-lines)]' < data
D000001 D000001 2 1975
D000001 D002413 3 1976
D000001 D004298 1 1976
D000002 D000002 1 1985
D000003 D000900 2 1975
D000003 D004134 2 1983
Upvotes: 2
Reputation: 3791
Since the file is large, you should not use an in-memory dictionary to manage data. Start reading the source file and output the results directly to the target file, all you really need is 3 variables,
One to store the current tuple, second to store the count, and third to store the highest value. And when the tuple changes, write the values to the output file and continue.
This one will have very little memory footprint and can handle insanely large files as well. But of course this would work only because your tuples are sorted.
Upvotes: 3
Reputation: 19050
Solution:
#!/usr/bin/env python
def readdata(filename):
last = []
count = 0
with open(filename, "r") as fd:
for line in fd:
tokens = line.strip().split()
tokens[2] = int(tokens[2])
if not last:
last = tokens
if tokens[:2] != last[:2]:
yield last[:2], count or 1, last[2]
last = tokens
count = 1
else:
count += 1
tokens[2] = min(tokens[2], last[2])
yield last[:2], count, last[2]
with open("output.txt", "w") as fd:
for words, count, year in readdata("data.txt"):
fd.write(
"{0:s} {1:s} ({2:d} {3:d})\n".format(
words[0], words[1], count, year
)
)
Output:
D000001 D000001 (2 1975)
D000001 D002413 (3 1976)
D000001 D004298 (1 1976)
D000002 D000002 (1 1985)
D000003 D000900 (2 1975)
D000003 D004134 (2 1983)
Discussion:
The algorithm in effect is quite similar to itertools.groupby (See the other answer that uses this but assumes Python 3.x).
It's probably worth noting that this implementation is also ``O(n`)) (Big O).
Upvotes: 2
Reputation: 810
Groupby and generators the way to go:
import csv
from itertools import groupby
def count_duplicate(it):
# group by frist two fields
groups = groupby(it, lambda line: line[:2])
# this will produce (key, group) pairs, where a group is an iterator
# containing ['field0', 'field1', year] values were the field0 and field1
# strings are the same respectively
# the min_and_count function converts such a group into count and min pair
def min_and_count(group):
i, min_year = 0, 99999
for _, _, year in group:
i += 1
min_year = year if year < min_year else min_year
return (i, min_year)
yield from map(lambda x: x[0] + [min_and_count(x[1])], groups)
with open("test.srt") as fp:
# this reads the lines in a lazy fashion and filter empty lines out
lines = filter(bool, csv.reader(fp, delimiter=' '))
# convert the last value to integer (still in a lazy fashion)
lines = map(lambda line: [line[0], line[1], int(line[2])], lines)
# write result to another file
with open("result_file", "w") as rf:
for record in count_duplicate(lines):
rf.write(str(record) + '\n')
NB: This solution is a Python 3.x solution where filter
and map
return iterators and not list
(s) as they do in Python 2.x
Upvotes: 3