Geuis
Geuis

Reputation: 42267

How to remove unique, then duplicate dictionaries in a list?

Given the following list that contains some duplicate and some unique dictionaries, what is the best method to remove unique dictionaries first, then reduce the duplicate dictionaries to single instances? I gotta say I only recently started getting into Python but its making this project so much easier. I'm just a bit stumped on this kind of problem.

So my list looks like this:

[{  'file': u'/file.txt',
    'line': u'line 666',
    'rule': u'A DUPLICATE RULE'}

{   'file': u'/file.txt',
    'line': u'line 666',
    'rule': u'A DUPLICATE RULE'}

{   'file': u'/uniquefile.txt',
    'line': u'line 999',
    'rule': u'A UNIQUE RULE'}]

What I'm going for is in the end, the list should look like:

[{  'file': u'/file.txt',
    'line': u'line 666',
    'rule': u'A DUPLICATE RULE'}]

Upvotes: 3

Views: 681

Answers (7)

gnud
gnud

Reputation: 78518

I always prefer to work with objects instead of dicts, if the fields are the same for every item.

So, I define a class:

class rule(object):
    def __init__(self, file, line, rule):
        self.file = file
        self.line = line
        self.rule = rule

    #Not a "magic" method, just a helper for all the methods below :)
    def _tuple_(self):
        return (self.file, self.line, self.rule)

    def __eq__(self, other):
        return cmp(self, other) == 0

    def __cmp__(self, other):
        return cmp(self._tuple_(), rule._tuple_(other))

    def __hash__(self):
        return hash(self._tuple_())

    def __repr__(self):
        return repr(self._tuple_())

Now, create a list of these objects, and sort it. ruledict_list can be the example data in your question.

rules = [rule(**r) for r in ruledict_list]
rules.sort()

Loop through the (sorted) list, removing unique objects as we go. Finally, create a set, to remove duplicates. The loop will also remove one of each duplicate object, but that doesn't really matter.

pos = 0
while(pos < len(rules)):
    while pos < len(rules)-1 and rules[pos] == rules[pos+1]:
        print "Skipping rule %s" % rules[pos]
        pos+=1
    rules.pop(pos)
rule_set = set(rules)

Upvotes: 2

Steven Huwig
Steven Huwig

Reputation: 20784

>>> import itertools
>>> list(a[0] for a in itertools.groupby(sorted(data)) if len(list(a[1])) > 1)
[{'file': u'/file.txt', 'line': u'line 666', 'rule': u'A DUPLICATE RULE'}]

There's probably a more optimal way to check this than len(list(a[1])).

Edit: I added a call to sorted.

Upvotes: 1

steveha
steveha

Reputation: 76695

This answer is based on Steven Huwig's answer. It's similar to his, but I use sorted() on the list so that groupby() works correctly.

Also, since he said "There's probably a more optimal way to check this than len(list(a[1])).", I decided to use some other way to check for non-unique items. Instead of forcing the whole list, I try to call the .next() method on the iterator, twice. If it works twice, there are at least two items in the iterator, and we are done with it; if we get a StopIteration exception on the first or second call to .next() there was zero or one items in the iterator. (Actually, since we got this iterator from itertools.groupby we know it will have at least one item in it.)

Also, instead of using explicit tuple indexing like a[0] and a[1], I used tuple unpacking, since that's what the cool kids seem to be doing these days.

Finally, instead of using a generator expression to compute the list, and using list() to force it to expand out into a list, I simply used a list comprehension.

data = [
    {
        'file': u'/file.txt',
        'line': u'line 666',
        'rule': u'A DUPLICATE RULE'
    },

    {   'file': u'/uniquefile.txt',
        'line': u'line 999',
        'rule': u'A UNIQUE RULE'
    },

    {   'file': u'/file.txt',
        'line': u'line 666',
        'rule': u'A DUPLICATE RULE'
    },

]

from itertools import groupby

def notunique(itr):
    try:
        itr.next()
        itr.next()
        return True
    except StopIteration:
        return False

def unique_list(lst):
    return [key for key, itr in groupby(sorted(lst)) if notunique(itr)]

print(unique_list(data))

Upvotes: 1

Joe Holloway
Joe Holloway

Reputation: 28948

Another option is to create your own data structure instead of using a dict. If you do this, then you can override __cmp__, __eq__ and __hash__. This will give you the ability to then use the 'set' data type in all its glory.

Here's one possible implementation, though I make no promises about the quality of the hash routine I've provided:

class Thing(object):
    def __init__(self, file, line, rule):
        self.file = file
        self.line = line
        self.rule = rule

    def __cmp__(self, other):
        result = cmp(self.file, other.file)
        if result == 0:
            result = cmp(self.line, other.line)
        if result == 0:
            result = cmp(self.rule, other.rule)
        return result

    def __eq__(self, other):
        return cmp(self, other) == 0

    def __hash__(self):
        return hash(self.file) * hash(self.line) * hash(self.rule)

    def __str__(self):
        return ', '.join([self.file, self.line, self.rule])

things = [ Thing(u'/file.txt', u'line 666', u'A DUPLICATE RULE'),
  Thing(u'/file.txt', u'line 666', u'A DUPLICATE RULE'),
  Thing(u'/uniquefile.txt', u'line 999', u'A UNIQUE RULE')]

duplicate_things = set()
unique_things = set()
for t in things:
    if t in unique_things:
        duplicate_things.add(t)
    else:
        unique_things.add(t)

If you need to get back to a list, just construct one from the resulting set:

unique_things = list(unique_things)
duplicate_things = list(duplicate_things)

It's a bit more code to create your own class like this, but may give you other options down the road if your program grows in complexity.

Edit

OK, my hands are faster than my eyes tonight, but I think this edit solves the problem pointed out by @nosklo

Upvotes: 0

nosklo
nosklo

Reputation: 222842

One idea is to sort the data. Assume inputdata is your list from above:

from itertools import groupby
from operator import itemgetter

inputdata.sort(key=itemgetter(*inputdata[0])) # ensures order
print [k for k, g in groupby(inputdata) if len(list(g)) > 1]

prints:

[{'line': u'line 666', 'file': u'/file.txt', 'rule': u'A DUPLICATE RULE'}]

Upvotes: 4

nosklo
nosklo

Reputation: 222842

Another way is to make a counter for each dict data, based on a frozenset of items:

from operator import itemgetter
from collections import defaultdict

counter = defaultdict(int)
for d in inputdata:
    counter[frozenset(d.iteritems())] += 1

result = [dict(item) for item, count in counter.iteritems() if count > 1]
print result

I think that is the best answer so far, because it is very simple to understand and will work linearly.

Upvotes: 1

EMP
EMP

Reputation: 61971

I'd make another dictionary, using the existing dictionaries as keys and the count of occurrences as values. (Python doesn't allow dictionaries to be used as dictionary keys out of the box, but there are a couple of ways of doing that mentioned in this answer.) Then it's just a matter of iterating over it and selecting the keys where the value is greater than 1.

Of course, using dictionaries as keys relies on their contents not changing over time - at least over the time that you need to use the resulting dictionary. (This is why Python doesn't support it natively.)

Upvotes: 1

Related Questions