Reputation: 3
I have a list consisting of tuples containing 2 strings.
The first one is a checksum and the second the name of the corresponding file.
Is there a fast way to search for duplicate checksums and return the corresponding filenames?
For example:
[("sumstring1","abc.txt"),("sumstring2","def.txt"),("sumstring1","ghi.txt"),("sumstring2","jkl.txt")]
-->
[("abc.txt","ghi.txt"),("def.txt","jkl.txt")]
I've tried making two different lists, one with the cheksums and one with the filenames.
Then I used collections.Counter
to find duplicate checksums. Using list.index()
I got their indexes and the corresponding filenames out of the other list, like so:
csList=["sumstring1","sumstring2","sumstring1","sumstring2"]
fnList=["abc.txt","def.txt","ghi.txt","jkl.txt"]
indexList=[]
multiList=[(k,v) for k,v in collections.Counter(csList).items() if v>1]
(In this case multiList
would be [("sumstring1",2),("sumstring2",2)]
)
for elem in multiList:
temp=()
for i in range(elem[1]):
temp+=(csList.index(elem[0]),)
csList.remove(elem[0])
indexList.append(temp)
This gave me a list with the indexes of the duplicate files ([(0,2),(1,3)]
) which I could then use to find the filenames.
This works but is very ugly. Is there a simpler, more "python" way to do it?
Upvotes: 0
Views: 292
Reputation: 43
What you want is dictionary. This will give you instant lookup times and will eliminate duplicate keys.
d = {}
d['sumstring1'] = 'abc.txt'
d.get('sumstring1')
>>> 'abc.txt'
d['sumstring1']
>>> 'abc.txt'
You can't add a key twice, so if you do:
d['sumstring1'] = 'def.txt'
You'll replace the old value with the new value.
To record more than one result, you can simply store a list in the dictionary list so:
d['sumstring2'] = ['ghi.txt', 'jkl.txt']
Upvotes: 0
Reputation: 86124
You could use a dictionary.
tuples = [("sumstring1","abc.txt"),("sumstring2","def.txt"),("sumstring1","ghi.txt"),("sumstring2","jkl.txt")]
lookup = {}
for checksum, filename in tuples:
lookup.setdefault(checksum, []).append(filename)
for checksum, filenames in lookup.items():
if len(filenames) >= 2:
print(checksum, filenames)
Output is
sumstring1 ['abc.txt', 'ghi.txt']
sumstring2 ['def.txt', 'jkl.txt']
Upvotes: 0
Reputation: 365925
You almost never want to use index
on a list. Keep track of where you are while iterating; don't try to find your position again from the value.
In this case, what you really want is a "multidict", a dictionary that maps keys to collections of values. In this case, checksums to collections of names. Then, any checksum that maps to a set of more than 1 name, it's a dup, and the set is exactly the list of names you want to print out, so that's all it takes.
In Python, a multidict is usually represented as a dict whose values are either lists or sets. You could use tuples, as you were attempting to, and it will work—but conceptually, they usually represent fixed-length, heterogeneous collections of values, where the index of a value tells you something about its meaning. What we have here is arbitrary-length. homogenous collections of values, where the index is meaningless, and even the order is meaningless. That's a set, not a tuple. (If the order isn't meaningless, then it's either a list, or an OrderedSet
.)
For example:
>>> pairs = [("sumstring1","abc.txt"), ("sumstring2","def.txt"),
... ("sumstring1","ghi.txt"), ("sumstring2","jkl.txt")]
>>> dups = collections.defaultdict(set)
>>> for checksum, name in pairs:
... dups[checksum].add(name)
>>> dups
defaultdict(<class 'set'>, {'sumstring1': {'ghi.txt', 'abc.txt'}, 'sumstring2': {'def.txt', 'jkl.txt'}})
To eliminate any non-dups:
>>> dups = {checksum: names for checksum, names in dups.items() if len(names) > 1}
>>> dups
{'sumstring1': {'abc.txt', 'ghi.txt'}, 'sumstring2': {'def.txt', 'jkl.txt'}}
(Of course we didn't have any non-dups in your example, so this wasn't very exciting).
If you don't care about the checksums, and just want a list of sets:
>>> dups = list(dups.values())
And if you really want tuples instead of sets for some reason:
>>> dups = [tuple(names) for names in dups.values()]
>>> dups
[('ghi.txt', 'abc.txt'), ('def.txt', 'jkl.txt')]
Upvotes: 1