Reputation: 1073
I have a program which reads in python object one by one (this is fixed) and will need to remove duplicate objects. Program will output a list of unique objects.
Psuedo-code is similar to this:
1. Create an empty list to store unique object and return at the end
2. Read in a single object
3. If the identical object is not in the list, add to the list
4. Repeat 2 and 3 until no more objects to read, then terminate and return the list (and the number of duplicate objects that were removed).
Actual code uses set operation to check for duplicates:
#!/usr/bin/python
import MyObject
import pickle
numDupRemoved = 0
uniqueObjects = set()
with open(inputFile, 'rb') as fileIn:
while 1:
try:
thisObject = pickle.load(fileIn)
if thisObject in uniqueObjects:
numDupRemoved += 1
continue
else:
uniqueObjects.add(thisObject)
except EOFError:
break
print("Number of duplicate objects removed: %d" %numDupRemoved)
return list(uniqueObjects)
(simplified) object looks like this (note that all values are integers, so we don't need to worry about floating point precision errors) :
#!/usr/bin/python
class MyObject:
def __init__(self, attr1, attr2, attr3):
self.attribute1 = attr1 # List of ints
self.attribute2 = attr2 # List of lists (each list is a list of ints)
self.attribute3 = attr3 # List of ints
def __eq__(self, other):
if isinstance(other, self__class__):
return (self.attribute1, self.attribute2, self.attribute3) == (other.attribute1, other.attribute2, other.attribute3)
def __hash__(self):
return self.generateHash()
def generateHash(self):
# Convert lists to tuples
attribute1_tuple = tuple(self.attribute1)
# Since attribute2 is list of list, convert to tuple of tuple
attribute2_tuple = []
for sublist in self.attribute2:
attribute2_tuple.append(tuple(sublist))
attribute2_tuple = tuple(attribute2_tuple)
attribute3_tuple = tuple(self.attribute3)
return hash((attribute1_tuple, attribute2_tuple, attribute3_tuple))
However, I now need to keep track of duplicates by individual attribute or subset of attributes of MyObject. That is, if the previous code was only removing duplicates in the darker blue region of the diagram below (where two objects are considered duplicate is all 3 attributes are identical), we would like to now: 1. Remove duplicate by subset of attribute (attribute 1 and 2) AND/OR individual attribute (attribute 3) 2. Be able to track 3 disjoint regions of the diagram
I have created two more objects to do this:
#!/usr/bin/python
class MyObject_sub1:
def __init__(self, attr1, attr2):
self.attribute1 = attr1 # List of ints
self.attribute2 = attr2 # List of lists (each list is a list of ints)
def __eq__(self, other):
if isinstance(other, self__class__):
return (self.attribute1, self.attribute2) == (other.attribute1, other.attribute2)
def __hash__(self):
return self.generateHash()
def generateHash(self):
# Convert lists to tuples
attribute1_tuple = tuple(self.attribute1)
# Since attribute2 is list of list, convert to tuple of tuple
attribute2_tuple = []
for sublist in self.attribute2:
attribute2_tuple.append(tuple(sublist))
attribute2_tuple = tuple(attribute2_tuple)
return hash((attribute1_tuple, attribute2_tuple))
and
#!/usr/bin/python
class MyObject_sub2:
def __init__(self, attr3):
self.attribute3 = attr3 # List of ints
def __eq__(self, other):
if isinstance(other, self__class__):
return (self.attribute3) == (other.attribute3)
def __hash__(self):
return hash(tuple(self.attribute3))
Duplicate removal code is updated as below:
#!/usr/bin/python
import MyObject
import MyObject_sub1
import MyObject_sub2
import pickle
# counters
totalNumDupRemoved = 0
numDupRemoved_att1Att2Only = 0
numDupRemoved_allAtts = 0
numDupRemoved_att3Only = 0
# sets for duplicate removal purposes
uniqueObjects_att1Att2Only = set()
uniqueObjects_allAtts = set() # Intersection part in the diagram
uniqueObjects_att3Only = set()
with open(inputFile, 'rb') as fileIn:
while 1:
try:
thisObject = pickle.load(fileIn)
# I will omit how thisObject_sub1 (MyObject_sub1) and thisObject_sub2 (MyObject_sub2) are created for brevity
if thisObject_sub1 in uniqueObjects_att1Att2Only or thisObject_sub2 in uniqueObjects_att3Only:
totalNumDupRemoved += 1
if thisObject in uniqueObjects_allAtts:
numDupRemoved_allAtts += 1
elif thisObject_sub1 in uniqueObjects_att1Att2Only:
numDupRemoved_att1Att2Only += 1
else:
numDupRemoved_att3Only += 1
continue
else:
uniqueObjects_att1Att2Only.add(thisObject_sub1)
uniqueObjects_allAtts.add(thisObject) # Intersection part in the diagram
uniqueObjects_att3Only.add(thisObject_sub2)
except EOFError:
break
print("Total number of duplicates removed: %d" %totalNumDupRemoved)
print("Number of duplicates where all attributes are identical: %d" %numDupRemoved_allAtts)
print("Number of duplicates where attributes 1 and 2 are identical: %d" %numDupRemoved_att1Att2Only)
print("Number of duplicates where only attribute 3 are identical: %d" %numDupRemoved_att3Only)
return list(uniqueObjects_allAtts)
What's been driving me insane is that "numDupRemoved_allAtts" from the second program do not match with "numDupRemoved" from the first program.
For example, both programs read over the same file containing about 80,000 total objects and outputs were vastly different:
Number of duplicate objects removed: 47,742 (which should be the intersecting part of the diagram)
Total number of duplicates removed: 66,648
Number of duplicates where all attributes are identical: 18,137 (intersection of diagram)
Number of duplicates where attributes 1 and 2 are identical: 46,121 (left disjoint set of diagram)
Number of duplicates where only attribute 3 are identical: 2,390 (right disjoint set of diagram)
Note that before I tried using multiple python objects (MyObject_sub1 and MyObject_sub2) and set operations, I have tried using tuple equality (checking equality of tuple of individual or subset of attributes) for duplicate checking as well, but the numbers still didn't match up.
Am I missing some fundamental python concepts here? What would be causing this error? Any help would be greatly appreicated
Upvotes: 1
Views: 122
Reputation: 10959
Example: If first processed object has attributes (1, 2, 3)
and next has (1, 2, 4)
then in the first variant, both are added as unique (and recognized later).
In the second variant, the first object would be recorded in uniqueObjects_att1Att2Only
(and the other sets). When the second object now arrives the
if thisObject_sub1 in uniqueObjects_att1Att2Only or thisObject_sub2 in uniqueObjects_att3Only:
is true and the else
part with recording to uniqueObjects_allAtts
isn't executed. This means that (1, 2, 4)
will never be added to uniqueObjects_allAtts
and will never increment numDupRemoved_allAtts
regardless how often it appears.
Solution: Let the duplicate detection for each set happen independently one after another.
For recording of totalNumDupRemoved
create a flag which is set to True
when one of the duplicate detections triggers and increment totalNumDupRemoved
if the flag is true.
Upvotes: 1