Reputation: 111
I have a dictionary that has has filename like 1.xml and then the DeviceIDs like 3 and 12.
{'1.xml': ['3', '12'], '2.xml': ['23', '17''], '3.xml': ['1', '12']}
And I have a code that compares the DeviceIDs and displays when there are duplicates. Right now it only works when all of the files include the DeviceID. When running this code:
it = iter(dict.values())
intersection = set(next(it))
print(intersection)
for vals in it:
intersection &= set(vals)
it returns
set()
because the DeviceID is only in first and third file, but not in second. Can someone help me to modify this code to get it to display the DeviceID when it is only a duplicate in some of the files?
Upvotes: 2
Views: 1199
Reputation: 5835
The answer posted by Moses is fewer lines of code, but this addresses your question more directly and might perform better, depending on the dataset:
The reason your code doesn't work is because rather than &
-ing the intersections together, you actually want to take the union of all intersections. The following updates to your code illustrate how to do this:
dev_ids = {'1.xml': ['3', '12'], '2.xml': ['23', '17'], '3.xml': ['1', '12']}
it = iter(dev_ids.values())
all_ids = set(next(it))
dups = set()
for vals in it:
vals_set = set(vals)
dups.update(all_ids.intersection(vals_set))
all_ids.update(vals_set)
print(dups)
As you can see, we accumulate all the IDs into a set - .update()
is essentially an in-place union operation - and perform intersections on it as we go. Each intersection can be thought of as the "duplicates" contained in that file. We accumulate the duplicates into the variable dup
and this becomes our answer.
Upvotes: 1
Reputation: 78546
The set
intersection drops all the previous duplicates when a new value in the dictionary does not contain them. So instead of the set
, you can use a multiset - collections.Counter
- to get a count of the number of times each DeviceID appears in the filename-deviceid dictionary:
from collections import Counter
d = {'1.xml': ['3', '12'], '2.xml': ['23', '17'], '3.xml': ['1', '12']}
c = Counter(i for val in d.values() for i in val)
print(c)
# Counter({'12': 2, '1': 1, '17': 1, '23': 1, '3': 1})
print(c.most_common(1))
# [('12', 2)]
If you have a large number of items and you're not sure of which number to pass to most_common
in order to get the duplicated IDs, then you could use:
dupe_ids = [id for id, count in c.items() if count > 1]
Upvotes: 3