A. Fant
A. Fant

Reputation: 13

Algorithm to convert a list of paired equivalent values into lists of all equivalent values?

I am working on a program that needs to be able to remove duplicate values from a list of lists. However, I can only identify the duplicate values by a pairwise comparison. When I am done with the comparisons, I have a list of equivalent pairs. But I need a list of all the equivalent values to do further processing in order to determine which of the duplicate values get kept.

I've crufted together some code that seemed to work for a few elements, but it's not working under load when I try to use it for lists with a few thousand entries. The code I am using is:


    seen = []
    holding = []

    for dup_pair in all_dup_pairs:
        if dup_pair[0] not in seen and dup_pair[1] not in seen and dup_pair[0] not in holding:
            holding.append(dup_pair[0])
            holding.sort()
            seen.append(dup_pair[0])
            seen.append(dup_pair[1])
            seen.sort()
        if dup_pair[1] not in seen:
            seen.append(dup_pair[1])
            seen.sort()

    for item in holding:
        final_duplicates.append([item])

    for dup_pair in all_dup_pairs:
        for i in range(len(final_duplicates)):
            if dup_pair[0] in final_duplicates[i] and dup_pair[1] not in final_duplicates[i]:
                final_duplicates[i].append(dup_pair[1])

(yes, I know that it's inefficient and ugly)

So, for example, if the original elements were [a,c,a,a,b,b,d,e,b,c], I would start with dup_pairs being [ [0,2], [0,3], [1,9], [2,3], [4,5], [4,8], [5,8] ] and I would like to end up with final_duplicates being [ [0,2,3], [1,9] [4,5,8] ]. As I said, the code works on small examples like this, but it fails on the much larger versions of the list I need for production, and instead of just trying to fix the code, I'd like to try to do it "right" so that I can actually work with it again in 18 months when the problem comes up again. Thanks to anyone who has any suggestions for the right algorithm.

Upvotes: 1

Views: 113

Answers (2)

Andrei Odegov
Andrei Odegov

Reputation: 3429

Check out the following:

def gum(l):
    g = {}
    for i, k in enumerate(l):
        g.setdefault(k, []).append(i)
    return g

x = 'acaabbdebc'
print(gum(x))

Output:

{'b': [4, 5, 8], 'a': [0, 2, 3], 'e': [7], 'd': [6], 'c': [1, 9]}

Upvotes: 0

B.Gees
B.Gees

Reputation: 1165

you can do:

import re
x = ["a","c","a","a","b","b","d","e","b","c"]
s = ''.join(x)
[(v, [m.start() for m in re.finditer(v, s)]) for v in set(x)]

and the result is:

[('c', [1, 9]), ('d', [6]), ('e', [7]), ('b', [4, 5, 8]), ('a', [0, 2, 3])]

Upvotes: 2

Related Questions