user32147
user32147

Reputation: 1073

Python duplicate object removal by subset of its attributes or individual attribute

I have a program which reads in python object one by one (this is fixed) and will need to remove duplicate objects. Program will output a list of unique objects.

Psuedo-code is similar to this:

1. Create an empty list to store unique object and return at the end
2. Read in a single object
3. If the identical object is not in the list, add to the list
4. Repeat 2 and 3 until no more objects to read, then terminate and return the list (and the number of duplicate objects that were removed).

Actual code uses set operation to check for duplicates:

#!/usr/bin/python
import MyObject
import pickle

numDupRemoved = 0
uniqueObjects = set() 

with open(inputFile, 'rb') as fileIn:
    while 1:
        try:
            thisObject = pickle.load(fileIn)
            if thisObject in uniqueObjects:
                numDupRemoved += 1
                continue
            else:
                uniqueObjects.add(thisObject)
        except EOFError:
            break
    print("Number of duplicate objects removed: %d" %numDupRemoved)
return list(uniqueObjects)

(simplified) object looks like this (note that all values are integers, so we don't need to worry about floating point precision errors) :

#!/usr/bin/python
class MyObject:
    def __init__(self, attr1, attr2, attr3):
        self.attribute1 = attr1  # List of ints
        self.attribute2 = attr2  # List of lists (each list is a list of ints)
        self.attribute3 = attr3  # List of ints

    def __eq__(self, other):
        if isinstance(other, self__class__):
            return (self.attribute1, self.attribute2, self.attribute3) == (other.attribute1, other.attribute2, other.attribute3)

    def __hash__(self):
        return self.generateHash()

    def generateHash(self):
        # Convert lists to tuples 
        attribute1_tuple = tuple(self.attribute1)

        # Since attribute2 is list of list, convert to tuple of tuple
        attribute2_tuple = []
        for sublist in self.attribute2:
            attribute2_tuple.append(tuple(sublist))
        attribute2_tuple = tuple(attribute2_tuple)

        attribute3_tuple = tuple(self.attribute3)

        return hash((attribute1_tuple, attribute2_tuple, attribute3_tuple))

However, I now need to keep track of duplicates by individual attribute or subset of attributes of MyObject. That is, if the previous code was only removing duplicates in the darker blue region of the diagram below (where two objects are considered duplicate is all 3 attributes are identical), we would like to now: 1. Remove duplicate by subset of attribute (attribute 1 and 2) AND/OR individual attribute (attribute 3) 2. Be able to track 3 disjoint regions of the diagram

enter image description here

I have created two more objects to do this:

#!/usr/bin/python
class MyObject_sub1:
    def __init__(self, attr1, attr2):
        self.attribute1 = attr1  # List of ints
        self.attribute2 = attr2  # List of lists (each list is a list of ints)

    def __eq__(self, other):
        if isinstance(other, self__class__):
            return (self.attribute1, self.attribute2) == (other.attribute1, other.attribute2)

    def __hash__(self):
        return self.generateHash()

    def generateHash(self):
        # Convert lists to tuples 
        attribute1_tuple = tuple(self.attribute1)

        # Since attribute2 is list of list, convert to tuple of tuple
        attribute2_tuple = []
        for sublist in self.attribute2:
            attribute2_tuple.append(tuple(sublist))
        attribute2_tuple = tuple(attribute2_tuple)

        return hash((attribute1_tuple, attribute2_tuple))

and

#!/usr/bin/python
class MyObject_sub2:
    def __init__(self, attr3):
        self.attribute3 = attr3  # List of ints

    def __eq__(self, other):
        if isinstance(other, self__class__):
            return (self.attribute3) == (other.attribute3)

    def __hash__(self):
        return hash(tuple(self.attribute3))

Duplicate removal code is updated as below:

#!/usr/bin/python
import MyObject
import MyObject_sub1
import MyObject_sub2
import pickle

# counters 
totalNumDupRemoved = 0
numDupRemoved_att1Att2Only = 0
numDupRemoved_allAtts = 0
numDupRemoved_att3Only = 0

# sets for duplicate removal purposes
uniqueObjects_att1Att2Only = set()
uniqueObjects_allAtts = set() # Intersection part in the diagram
uniqueObjects_att3Only = set()


with open(inputFile, 'rb') as fileIn:
    while 1:
        try:
            thisObject = pickle.load(fileIn)
            # I will omit how thisObject_sub1 (MyObject_sub1) and thisObject_sub2 (MyObject_sub2) are created for brevity

            if thisObject_sub1 in uniqueObjects_att1Att2Only or thisObject_sub2 in uniqueObjects_att3Only:
                totalNumDupRemoved += 1
                if thisObject in uniqueObjects_allAtts:
                    numDupRemoved_allAtts += 1
                elif thisObject_sub1 in uniqueObjects_att1Att2Only:
                    numDupRemoved_att1Att2Only += 1
                else:
                    numDupRemoved_att3Only += 1
                continue
            else:
                uniqueObjects_att1Att2Only.add(thisObject_sub1)
                uniqueObjects_allAtts.add(thisObject) # Intersection part in the diagram
                uniqueObjects_att3Only.add(thisObject_sub2)
        except EOFError:
            break
    print("Total number of duplicates removed: %d" %totalNumDupRemoved)
    print("Number of duplicates where all attributes are identical: %d" %numDupRemoved_allAtts)
    print("Number of duplicates where attributes 1 and 2 are identical: %d" %numDupRemoved_att1Att2Only)
    print("Number of duplicates where only attribute 3 are identical: %d" %numDupRemoved_att3Only)
return list(uniqueObjects_allAtts)

What's been driving me insane is that "numDupRemoved_allAtts" from the second program do not match with "numDupRemoved" from the first program.

For example, both programs read over the same file containing about 80,000 total objects and outputs were vastly different:

First program output

Number of duplicate objects removed: 47,742 (which should be the intersecting part of the diagram)

Second program output

Total number of duplicates removed: 66,648

Number of duplicates where all attributes are identical: 18,137 (intersection of diagram)

Number of duplicates where attributes 1 and 2 are identical: 46,121 (left disjoint set of diagram)

Number of duplicates where only attribute 3 are identical: 2,390 (right disjoint set of diagram)

Note that before I tried using multiple python objects (MyObject_sub1 and MyObject_sub2) and set operations, I have tried using tuple equality (checking equality of tuple of individual or subset of attributes) for duplicate checking as well, but the numbers still didn't match up.

Am I missing some fundamental python concepts here? What would be causing this error? Any help would be greatly appreicated

Upvotes: 1

Views: 122

Answers (1)

Michael Butscher
Michael Butscher

Reputation: 10959

Example: If first processed object has attributes (1, 2, 3) and next has (1, 2, 4) then in the first variant, both are added as unique (and recognized later).

In the second variant, the first object would be recorded in uniqueObjects_att1Att2Only (and the other sets). When the second object now arrives the

if thisObject_sub1 in uniqueObjects_att1Att2Only or thisObject_sub2 in uniqueObjects_att3Only:

is true and the else part with recording to uniqueObjects_allAtts isn't executed. This means that (1, 2, 4) will never be added to uniqueObjects_allAtts and will never increment numDupRemoved_allAtts regardless how often it appears.

Solution: Let the duplicate detection for each set happen independently one after another.

For recording of totalNumDupRemoved create a flag which is set to True when one of the duplicate detections triggers and increment totalNumDupRemoved if the flag is true.

Upvotes: 1

Related Questions