Reputation: 42267
Given the following list that contains some duplicate and some unique dictionaries, what is the best method to remove unique dictionaries first, then reduce the duplicate dictionaries to single instances? I gotta say I only recently started getting into Python but its making this project so much easier. I'm just a bit stumped on this kind of problem.
So my list looks like this:
[{ 'file': u'/file.txt',
'line': u'line 666',
'rule': u'A DUPLICATE RULE'}
{ 'file': u'/file.txt',
'line': u'line 666',
'rule': u'A DUPLICATE RULE'}
{ 'file': u'/uniquefile.txt',
'line': u'line 999',
'rule': u'A UNIQUE RULE'}]
What I'm going for is in the end, the list should look like:
[{ 'file': u'/file.txt',
'line': u'line 666',
'rule': u'A DUPLICATE RULE'}]
Upvotes: 3
Views: 681
Reputation: 78518
I always prefer to work with objects instead of dicts, if the fields are the same for every item.
So, I define a class:
class rule(object):
def __init__(self, file, line, rule):
self.file = file
self.line = line
self.rule = rule
#Not a "magic" method, just a helper for all the methods below :)
def _tuple_(self):
return (self.file, self.line, self.rule)
def __eq__(self, other):
return cmp(self, other) == 0
def __cmp__(self, other):
return cmp(self._tuple_(), rule._tuple_(other))
def __hash__(self):
return hash(self._tuple_())
def __repr__(self):
return repr(self._tuple_())
Now, create a list of these objects, and sort it. ruledict_list
can be the example data in your question.
rules = [rule(**r) for r in ruledict_list]
rules.sort()
Loop through the (sorted) list, removing unique objects as we go. Finally, create a set, to remove duplicates. The loop will also remove one of each duplicate object, but that doesn't really matter.
pos = 0
while(pos < len(rules)):
while pos < len(rules)-1 and rules[pos] == rules[pos+1]:
print "Skipping rule %s" % rules[pos]
pos+=1
rules.pop(pos)
rule_set = set(rules)
Upvotes: 2
Reputation: 20784
>>> import itertools
>>> list(a[0] for a in itertools.groupby(sorted(data)) if len(list(a[1])) > 1)
[{'file': u'/file.txt', 'line': u'line 666', 'rule': u'A DUPLICATE RULE'}]
There's probably a more optimal way to check this than len(list(a[1])).
Edit: I added a call to sorted.
Upvotes: 1
Reputation: 76695
This answer is based on Steven Huwig's answer. It's similar to his, but I use sorted()
on the list so that groupby()
works correctly.
Also, since he said "There's probably a more optimal way to check this than len(list(a[1])).", I decided to use some other way to check for non-unique items. Instead of forcing the whole list, I try to call the .next()
method on the iterator, twice. If it works twice, there are at least two items in the iterator, and we are done with it; if we get a StopIteration
exception on the first or second call to .next()
there was zero or one items in the iterator. (Actually, since we got this iterator from itertools.groupby
we know it will have at least one item in it.)
Also, instead of using explicit tuple indexing like a[0]
and a[1]
, I used tuple unpacking, since that's what the cool kids seem to be doing these days.
Finally, instead of using a generator expression to compute the list, and using list()
to force it to expand out into a list, I simply used a list comprehension.
data = [
{
'file': u'/file.txt',
'line': u'line 666',
'rule': u'A DUPLICATE RULE'
},
{ 'file': u'/uniquefile.txt',
'line': u'line 999',
'rule': u'A UNIQUE RULE'
},
{ 'file': u'/file.txt',
'line': u'line 666',
'rule': u'A DUPLICATE RULE'
},
]
from itertools import groupby
def notunique(itr):
try:
itr.next()
itr.next()
return True
except StopIteration:
return False
def unique_list(lst):
return [key for key, itr in groupby(sorted(lst)) if notunique(itr)]
print(unique_list(data))
Upvotes: 1
Reputation: 28948
Another option is to create your own data structure instead of using a dict. If you do this, then you can override __cmp__, __eq__ and __hash__. This will give you the ability to then use the 'set' data type in all its glory.
Here's one possible implementation, though I make no promises about the quality of the hash routine I've provided:
class Thing(object):
def __init__(self, file, line, rule):
self.file = file
self.line = line
self.rule = rule
def __cmp__(self, other):
result = cmp(self.file, other.file)
if result == 0:
result = cmp(self.line, other.line)
if result == 0:
result = cmp(self.rule, other.rule)
return result
def __eq__(self, other):
return cmp(self, other) == 0
def __hash__(self):
return hash(self.file) * hash(self.line) * hash(self.rule)
def __str__(self):
return ', '.join([self.file, self.line, self.rule])
things = [ Thing(u'/file.txt', u'line 666', u'A DUPLICATE RULE'),
Thing(u'/file.txt', u'line 666', u'A DUPLICATE RULE'),
Thing(u'/uniquefile.txt', u'line 999', u'A UNIQUE RULE')]
duplicate_things = set()
unique_things = set()
for t in things:
if t in unique_things:
duplicate_things.add(t)
else:
unique_things.add(t)
If you need to get back to a list, just construct one from the resulting set:
unique_things = list(unique_things)
duplicate_things = list(duplicate_things)
It's a bit more code to create your own class like this, but may give you other options down the road if your program grows in complexity.
Edit
OK, my hands are faster than my eyes tonight, but I think this edit solves the problem pointed out by @nosklo
Upvotes: 0
Reputation: 222842
One idea is to sort the data. Assume inputdata
is your list from above:
from itertools import groupby
from operator import itemgetter
inputdata.sort(key=itemgetter(*inputdata[0])) # ensures order
print [k for k, g in groupby(inputdata) if len(list(g)) > 1]
prints:
[{'line': u'line 666', 'file': u'/file.txt', 'rule': u'A DUPLICATE RULE'}]
Upvotes: 4
Reputation: 222842
Another way is to make a counter for each dict data, based on a frozenset of items:
from operator import itemgetter
from collections import defaultdict
counter = defaultdict(int)
for d in inputdata:
counter[frozenset(d.iteritems())] += 1
result = [dict(item) for item, count in counter.iteritems() if count > 1]
print result
I think that is the best answer so far, because it is very simple to understand and will work linearly.
Upvotes: 1
Reputation: 61971
I'd make another dictionary, using the existing dictionaries as keys and the count of occurrences as values. (Python doesn't allow dictionaries to be used as dictionary keys out of the box, but there are a couple of ways of doing that mentioned in this answer.) Then it's just a matter of iterating over it and selecting the keys where the value is greater than 1.
Of course, using dictionaries as keys relies on their contents not changing over time - at least over the time that you need to use the resulting dictionary. (This is why Python doesn't support it natively.)
Upvotes: 1