bsbob
bsbob

Reputation: 3

In a list of tuples, return tuple[1] if tuple[0] is a duplicate of another tuple[0] in the list

I have a list consisting of tuples containing 2 strings.
The first one is a checksum and the second the name of the corresponding file.
Is there a fast way to search for duplicate checksums and return the corresponding filenames?

For example:

[("sumstring1","abc.txt"),("sumstring2","def.txt"),("sumstring1","ghi.txt"),("sumstring2","jkl.txt")]
-->
    [("abc.txt","ghi.txt"),("def.txt","jkl.txt")]

I've tried making two different lists, one with the cheksums and one with the filenames. Then I used collections.Counter to find duplicate checksums. Using list.index() I got their indexes and the corresponding filenames out of the other list, like so:

csList=["sumstring1","sumstring2","sumstring1","sumstring2"]
fnList=["abc.txt","def.txt","ghi.txt","jkl.txt"]
indexList=[]

multiList=[(k,v) for k,v in collections.Counter(csList).items() if v>1]

(In this case multiList would be [("sumstring1",2),("sumstring2",2)])

for elem in multiList:
    temp=()
    for i in range(elem[1]):
        temp+=(csList.index(elem[0]),)
        csList.remove(elem[0])
    indexList.append(temp)

This gave me a list with the indexes of the duplicate files ([(0,2),(1,3)]) which I could then use to find the filenames.

This works but is very ugly. Is there a simpler, more "python" way to do it?

Upvotes: 0

Views: 292

Answers (3)

asher
asher

Reputation: 43

What you want is dictionary. This will give you instant lookup times and will eliminate duplicate keys.

d = {}
d['sumstring1'] = 'abc.txt'
d.get('sumstring1')
>>> 'abc.txt'
d['sumstring1']
>>> 'abc.txt'

You can't add a key twice, so if you do:

d['sumstring1'] = 'def.txt'

You'll replace the old value with the new value.

To record more than one result, you can simply store a list in the dictionary list so:

d['sumstring2'] = ['ghi.txt', 'jkl.txt']

Upvotes: 0

recursive
recursive

Reputation: 86124

You could use a dictionary.

tuples = [("sumstring1","abc.txt"),("sumstring2","def.txt"),("sumstring1","ghi.txt"),("sumstring2","jkl.txt")]

lookup = {}
for checksum, filename in tuples:
    lookup.setdefault(checksum, []).append(filename)

for checksum, filenames in lookup.items():
    if len(filenames) >= 2:
        print(checksum, filenames)

Output is

sumstring1 ['abc.txt', 'ghi.txt']
sumstring2 ['def.txt', 'jkl.txt']

Upvotes: 0

abarnert
abarnert

Reputation: 365925

You almost never want to use index on a list. Keep track of where you are while iterating; don't try to find your position again from the value.

In this case, what you really want is a "multidict", a dictionary that maps keys to collections of values. In this case, checksums to collections of names. Then, any checksum that maps to a set of more than 1 name, it's a dup, and the set is exactly the list of names you want to print out, so that's all it takes.

In Python, a multidict is usually represented as a dict whose values are either lists or sets. You could use tuples, as you were attempting to, and it will work—but conceptually, they usually represent fixed-length, heterogeneous collections of values, where the index of a value tells you something about its meaning. What we have here is arbitrary-length. homogenous collections of values, where the index is meaningless, and even the order is meaningless. That's a set, not a tuple. (If the order isn't meaningless, then it's either a list, or an OrderedSet.)

For example:

>>> pairs = [("sumstring1","abc.txt"), ("sumstring2","def.txt"),
...          ("sumstring1","ghi.txt"), ("sumstring2","jkl.txt")]
>>> dups = collections.defaultdict(set)
>>> for checksum, name in pairs:
...     dups[checksum].add(name)
>>> dups
defaultdict(<class 'set'>, {'sumstring1': {'ghi.txt', 'abc.txt'}, 'sumstring2': {'def.txt', 'jkl.txt'}})

To eliminate any non-dups:

>>> dups = {checksum: names for checksum, names in dups.items() if len(names) > 1}
>>> dups
{'sumstring1': {'abc.txt', 'ghi.txt'}, 'sumstring2': {'def.txt', 'jkl.txt'}}

(Of course we didn't have any non-dups in your example, so this wasn't very exciting).

If you don't care about the checksums, and just want a list of sets:

>>> dups = list(dups.values())

And if you really want tuples instead of sets for some reason:

>>> dups = [tuple(names) for names in dups.values()]
>>> dups
[('ghi.txt', 'abc.txt'), ('def.txt', 'jkl.txt')]

Upvotes: 1

Related Questions