Jan
Jan

Reputation: 111

List of duplicate values in a dictionary python

I have a dictionary that has has filename like 1.xml and then the DeviceIDs like 3 and 12.

{'1.xml': ['3', '12'], '2.xml': ['23', '17''], '3.xml': ['1', '12']}

And I have a code that compares the DeviceIDs and displays when there are duplicates. Right now it only works when all of the files include the DeviceID. When running this code:

it = iter(dict.values())
intersection = set(next(it))
print(intersection)

for vals in it:
    intersection &= set(vals)

it returns

set()

because the DeviceID is only in first and third file, but not in second. Can someone help me to modify this code to get it to display the DeviceID when it is only a duplicate in some of the files?

Upvotes: 2

Views: 1199

Answers (2)

gbrener
gbrener

Reputation: 5835

The answer posted by Moses is fewer lines of code, but this addresses your question more directly and might perform better, depending on the dataset:

The reason your code doesn't work is because rather than &-ing the intersections together, you actually want to take the union of all intersections. The following updates to your code illustrate how to do this:

dev_ids = {'1.xml': ['3', '12'], '2.xml': ['23', '17'], '3.xml': ['1', '12']}

it = iter(dev_ids.values())
all_ids = set(next(it))
dups = set()

for vals in it:
    vals_set = set(vals)
    dups.update(all_ids.intersection(vals_set))
    all_ids.update(vals_set)

print(dups)

As you can see, we accumulate all the IDs into a set - .update() is essentially an in-place union operation - and perform intersections on it as we go. Each intersection can be thought of as the "duplicates" contained in that file. We accumulate the duplicates into the variable dup and this becomes our answer.

Upvotes: 1

Moses Koledoye
Moses Koledoye

Reputation: 78546

The set intersection drops all the previous duplicates when a new value in the dictionary does not contain them. So instead of the set, you can use a multiset - collections.Counter - to get a count of the number of times each DeviceID appears in the filename-deviceid dictionary:

from collections import Counter

d = {'1.xml': ['3', '12'], '2.xml': ['23', '17'], '3.xml': ['1', '12']}

c = Counter(i for val in d.values() for i in val)
print(c)
# Counter({'12': 2, '1': 1, '17': 1, '23': 1, '3': 1})
print(c.most_common(1))
# [('12', 2)]

If you have a large number of items and you're not sure of which number to pass to most_common in order to get the duplicated IDs, then you could use:

dupe_ids = [id for id, count in c.items() if count > 1]

Upvotes: 3

Related Questions