Reputation: 1426
I have a large list of objects in Python that I'm storing in a text file (for lack of knowledge of how to use any other database for the present).
Currently there are 40,000 but I expect the list length eventually may exceed 1,000,000. I'm trying to remove duplicates, where duplicates are defined as different objects having the same value for a text string attribute, but keep the most recent version of that object (defined as having the highest value in another attribute).
What I want to make is a function that returns only objects 2 and 3 from the following list, reliably:
Object 1: text="hello" ID=1
Object 2: text="hello" ID=2
Object 3: text="something else" ID=3
Doing this manually (looping through the list each time for each object) is too slow already and will get slower with O(l^2)
, so I need a smarter way to do it. I have seen hashing the objects and using the set function recommended multiple times, but I have two questions about this that I haven't found satisfactory answers to:
How does hashing improve the efficiency to the degree it does?
How can I do this and retain only the most recent such object? The examples I have seen all use the set function and I'm not sure how that would return only the most recent one.
EDIT: I can probably find good answers to question 1 elsewhere, but I am still stuck on question 2. To take another stab at explaining it, hashing the objects above on their text and using the set function will return a set where the objects chosen from duplicates are randomly chosen from each group of duplicates (e.g., above, either a set of (Object 2, Object 3) or (Object 1, Object 3) could be returned; I need (Object 2, Object 3)).
Upvotes: 0
Views: 144
Reputation: 113930
change to using a database ...
import sqlite3
db = sqlite3.connect("my.db")
db.execute("CREATE TABLE IF NOT EXISTS my_items (text PRIMARY KEY, id INTEGER);")
my_list_of_items = [("test",1),("test",2),("asdasd",3)]
db.execute_many("INSERT OR REPLACE INTO my_items (text,id) VALUES (?,?)",my_list_of_items)
db.commit()
print(db.execute("SELECT * FROM my_items").fetchall())
this may have maginally higher overhead in terms of time ... but you will save in RAM
Upvotes: 1
Reputation: 15290
Hashing is a well-researched subject in Computer Science. One of the standard uses is for implementing what Python calls a dictionary
. (Perl calls the same thing a hash
, for some reason. ;-) )
The idea is that for some key, such as a string, you can compute a simple numeric function - the hash value - and use that number as a quick way to look up the associated value stored in the dictionary.
Python has the built-in function hash()
that returns the standard computation of this value. It also supports the __hash__()
function, for objects that wish to compute their own hash value.
In a "normal" scenario, one way to determine if you have seen a field value before would be to use the field value as part of a dictionary. For example, you might stored a dictionary that maps the field in question to the entire record, or a list of records that all share the same field value.
In your case, your data is too big (according to you), so that would be a bad idea. Instead, you might try something like this:
seen_before = {} # Empty dictionary to start with.
while ... something :
info = read_next_record() # You figure this out.
fld = info.fields[some_field] # The value you care about
hv = hash(fld) # Compute hash value for field.
if hv in seen_before:
print("This field value has been seen before")
else:
seen_before[hv] = True # Never seen ... until NOW!
Upvotes: 0
Reputation: 28596
Could use a dict with the text as key and the newest object for each key as value.
Setting up some demo data:
>>> from collections import namedtuple
>>> Object = namedtuple('Object', 'text ID')
>>> objects = Object('foo', 1), Object('foo', 2), Object('bar', 4), Object('bar', 3)
Solution:
>>> unique = {}
>>> for obj in objects:
if obj.text not in unique or obj.ID > unique[obj.text].ID:
unique[obj.text] = obj
>>> unique.values()
[Object(text='foo', ID=2), Object(text='bar', ID=4)]
Upvotes: 0