Reputation: 1426

Trying to remove duplicates from large list of objects, keep certain one

I have a large list of objects in Python that I'm storing in a text file (for lack of knowledge of how to use any other database for the present).

Currently there are 40,000 but I expect the list length eventually may exceed 1,000,000. I'm trying to remove duplicates, where duplicates are defined as different objects having the same value for a text string attribute, but keep the most recent version of that object (defined as having the highest value in another attribute).

What I want to make is a function that returns only objects 2 and 3 from the following list, reliably:

Object 1: text="hello"            ID=1
Object 2: text="hello"            ID=2
Object 3: text="something else"   ID=3

Doing this manually (looping through the list each time for each object) is too slow already and will get slower with O(l^2), so I need a smarter way to do it. I have seen hashing the objects and using the set function recommended multiple times, but I have two questions about this that I haven't found satisfactory answers to:

How does hashing improve the efficiency to the degree it does?
How can I do this and retain only the most recent such object? The examples I have seen all use the set function and I'm not sure how that would return only the most recent one.

EDIT: I can probably find good answers to question 1 elsewhere, but I am still stuck on question 2. To take another stab at explaining it, hashing the objects above on their text and using the set function will return a set where the objects chosen from duplicates are randomly chosen from each group of duplicates (e.g., above, either a set of (Object 2, Object 3) or (Object 1, Object 3) could be returned; I need (Object 2, Object 3)).

Upvotes: 0

Answers (3)

Joran Beasley

Reputation: 113930

change to using a database ...

import sqlite3
db = sqlite3.connect("my.db")
db.execute("CREATE TABLE IF NOT EXISTS my_items (text PRIMARY KEY, id INTEGER);")
my_list_of_items = [("test",1),("test",2),("asdasd",3)]
db.execute_many("INSERT OR REPLACE INTO my_items (text,id) VALUES (?,?)",my_list_of_items)
db.commit()

print(db.execute("SELECT * FROM my_items").fetchall())

this may have maginally higher overhead in terms of time ... but you will save in RAM

Upvotes: 1

aghast

Reputation: 15290

Hashing is a well-researched subject in Computer Science. One of the standard uses is for implementing what Python calls a dictionary. (Perl calls the same thing a hash, for some reason. ;-) )

The idea is that for some key, such as a string, you can compute a simple numeric function - the hash value - and use that number as a quick way to look up the associated value stored in the dictionary.

Python has the built-in function hash() that returns the standard computation of this value. It also supports the __hash__() function, for objects that wish to compute their own hash value.

In a "normal" scenario, one way to determine if you have seen a field value before would be to use the field value as part of a dictionary. For example, you might stored a dictionary that maps the field in question to the entire record, or a list of records that all share the same field value.

In your case, your data is too big (according to you), so that would be a bad idea. Instead, you might try something like this:

seen_before = {}    # Empty dictionary to start with.

while ... something :
    info = read_next_record()   # You figure this out.
    fld = info.fields[some_field]  # The value you care about

    hv = hash(fld)     # Compute hash value for field.

    if hv in seen_before:
        print("This field value has been seen before")

    else:
        seen_before[hv] = True  # Never seen ... until NOW!

Upvotes: 0

Stefan Pochmann

Reputation: 28596

Could use a dict with the text as key and the newest object for each key as value.

Setting up some demo data:

>>> from collections import namedtuple
>>> Object = namedtuple('Object', 'text ID')
>>> objects = Object('foo', 1), Object('foo', 2), Object('bar', 4), Object('bar', 3)

Solution:

>>> unique = {}
>>> for obj in objects:
        if obj.text not in unique or obj.ID > unique[obj.text].ID:
            unique[obj.text] = obj

>>> unique.values()
[Object(text='foo', ID=2), Object(text='bar', ID=4)]

Upvotes: 0

Trying to remove duplicates from large list of objects, keep certain one

Answers (3)

Related Questions