Andrej
Andrej

Reputation: 3849

Count duplicate pairs in file with constraint

Problem presentation

I have the following file (file is sorted according to all three columns):

D000001 D000001 1975
D000001 D000001 1976
D000001 D002413 1976
D000001 D002413 1979
D000001 D002413 1987
D000001 D004298 1976
D000002 D000002 1985
D000003 D000900 1975
D000003 D000900 1990
D000003 D004134 1983
D000003 D004134 1986

I need to count duplicate pairs (in the 1st and 2nd column) and to each such pair assign the lowest value from the 3rd column. For my toy file the output should be:

D000001 D000001 2 1975
D000001 D002413 3 1976
D000001 D004298 1 1976
D000002 D000002 1 1985
D000003 D000900 2 1975
D000003 D004134 2 1983

My questions

  1. File is huge (1 GB to 5 GB) and I wonder what is the most appropriate programming structure to implement in this settings?
  2. How to properly print the last (3rd) column? In the current settings (check code below) the program prints the last (highest) value.

My initial attempt with the current output is as follows.

my_dict = {}

with open('test.srt') as infile:
  for line in infile:
    line = line.rstrip()
    word1, word2, year = line.split('|')
    year = int(year)
    my_tuple = (word1, word2)
    if my_tuple in my_dict:
      freq += 1
      my_dict[my_tuple] = (freq, year)
    else:
      freq = 1
      my_dict[my_tuple] = (freq, year)

for key, value in my_dict.items():
  print key[0], key[1], value

Current output:

D000001 D000001 (2, 1976) ## Should be 1976 etc.
D000001 D002413 (3, 1987)
D000001 D004298 (1, 1976)
D000002 D000002 (1, 1985)
D000003 D000900 (2, 1990)
D000003 D004134 (2, 1986)

Upvotes: 2

Views: 241

Answers (4)

Kaz
Kaz

Reputation: 58617

TXR:

@(repeat)
@dleft @dright @num
@  (collect :gap 0 :vars (dupenum))
@dleft @dright @dupenum
@  (end)
@  (output)
@dleft @dright @(+ 1 (length dupenum)) @num
@  (end)
@(end)

Run:

$ txr data.txr data
D000001 D000001 2 1975
D000001 D002413 3 1976
D000001 D004298 1 1976
D000002 D000002 1 1985
D000003 D000900 2 1975
D000003 D004134 2 1983

AWK:

$1 != old1 || $2 != old2 { printf("%s", out);
                           count = 0
                           old1 = $1
                           old2 = $2
                           old3 = $3 }

                         { out = $1 " " $2 " " ++count " " old3 "\n" }

END                      { printf("%s", out); }

$ awk -f data.awk data
D000001 D000001 2 1975
D000001 D002413 3 1976
D000001 D004298 1 1976
D000002 D000002 1 1985
D000003 D000900 2 1975
D000003 D004134 2 1983

TXR Lisp functional one-liner:

$ txr -t '[(opip (mapcar* (op split-str @1 " "))
                 (partition-by [callf list first second])
                 (mapcar* (aret `@[@1 0..2] @(+ 1 (length @rest)) @[@1 2]`)))
           (get-lines)]' < data
D000001 D000001 2 1975
D000001 D002413 3 1976
D000001 D004298 1 1976
D000002 D000002 1 1985
D000003 D000900 2 1975
D000003 D004134 2 1983

Upvotes: 2

remudada
remudada

Reputation: 3791

Since the file is large, you should not use an in-memory dictionary to manage data. Start reading the source file and output the results directly to the target file, all you really need is 3 variables,

One to store the current tuple, second to store the count, and third to store the highest value. And when the tuple changes, write the values to the output file and continue.

This one will have very little memory footprint and can handle insanely large files as well. But of course this would work only because your tuples are sorted.

Upvotes: 3

James Mills
James Mills

Reputation: 19050

Solution:

#!/usr/bin/env python


def readdata(filename):
    last = []
    count = 0

    with open(filename, "r") as fd:
        for line in fd:
            tokens = line.strip().split()
            tokens[2] = int(tokens[2])

            if not last:
                last = tokens

            if tokens[:2] != last[:2]:
                yield last[:2], count or 1, last[2]
                last = tokens
                count = 1
            else:
                count += 1

            tokens[2] = min(tokens[2], last[2])

        yield last[:2], count, last[2]


with open("output.txt", "w") as fd:
    for words, count, year in readdata("data.txt"):
        fd.write(
            "{0:s} {1:s} ({2:d} {3:d})\n".format(
                words[0], words[1], count, year
            )
        )

Output:

D000001 D000001 (2 1975)
D000001 D002413 (3 1976)
D000001 D004298 (1 1976)
D000002 D000002 (1 1985)
D000003 D000900 (2 1975)
D000003 D004134 (2 1983)

Discussion:

  • This reads and processes the data iteratively (Python 2.x) so it doesn't read everything into memory allowing for processing of very large data files.
  • Complex data structures are also not required as long as the input data is sorted. We only need to keep track of the last set of tokens and keep track of the minimum year per set of "duplicates".

The algorithm in effect is quite similar to itertools.groupby (See the other answer that uses this but assumes Python 3.x).

It's probably worth noting that this implementation is also ``O(n`)) (Big O).

Upvotes: 2

anonymous_user_13
anonymous_user_13

Reputation: 810

Groupby and generators the way to go:

import csv
from itertools import groupby

def count_duplicate(it):
    # group by frist two fields
    groups = groupby(it, lambda line: line[:2])
    # this will produce (key, group) pairs, where a group is an iterator
    # containing ['field0', 'field1', year] values were the field0 and field1
    # strings are the same respectively
    # the min_and_count function converts such a group into count and min pair
    def min_and_count(group):
        i, min_year = 0, 99999
        for _, _, year in group:
            i += 1
            min_year = year if year < min_year else min_year
        return (i, min_year)

    yield from map(lambda x: x[0] + [min_and_count(x[1])], groups)


with open("test.srt") as fp:
    # this reads the lines in a lazy fashion and filter empty lines out
    lines = filter(bool, csv.reader(fp, delimiter=' '))
    # convert the last value to integer (still in a lazy fashion)
    lines = map(lambda line: [line[0], line[1], int(line[2])], lines)
    # write result to another file
    with open("result_file", "w") as rf:
        for record in count_duplicate(lines):
            rf.write(str(record) + '\n')

NB: This solution is a Python 3.x solution where filter and map return iterators and not list(s) as they do in Python 2.x

Upvotes: 3

Related Questions