Reputation: 4477

More efficient way to get unique first occurrence from a Python dict

I have a very large file I'm parsing and getting the key value from the line. I want only the first key and value, for only one value. That is, I'm removing the duplicate values

So it would look like:

{
A:1
B:2
C:3
D:2
E:2
F:3
G:1
}

and it would output:

{E:2,F:3,G:1}

It's a bit confusing because I don't really care what the key is. So E in the above could be replaced with B or D, F could be replaced with C, and G could be replaced with A.

Here is the best way I have found to do it but it is extremely slow as the file gets larger.

mapp = {}
value_holder = []

for i in mydict:
 if mydict[i] not in value_holder:
   mapp[i] = mydict[i]
   value_holder.append(mydict[i])

Must look through value_holder every time :( Is there a faster way to do this?

Upvotes: 4

Answers (6)

abarnert

Reputation: 365747

Yes, a trivial change makes it much faster:

value_holder = set()

(Well, you also have to change the append to add. But still pretty simple.)

Using a set instead of a list means each lookup is O(1) instead of O(N), so the whole operation is O(N) instead of O(N^2). In other words, if you have 10,000 lines, you're doing 10,000 hash lookups instead of 50,000,000 comparisons.

One caveat with this solution—and all of the others posted—is that it requires the values to be hashable. If they're not hashable, but they are comparable, you can still get O(NlogN) instead of O(N^2) by using a sorted set (e.g., from the blist library). If they're neither hashable nor sortable… well, you'll probably want to find some way to generate something hashable (or sortable) to use as a "first check", and then only walk the "first check" matches for actual matches, which will get you to O(NM), where M is the average number of hash collisions.

You might want to look at how unique_everseen is implemented in the itertools recipes in the standard library documentation.

Note that dictionaries don't actually have an order, so there's no way to pick the "first" duplicate; you'll just get one arbitrarily. In which case, there's another way to do this:

inverted = {v:k for k, v in d.iteritems()}
reverted = {v:k for k, v in inverted.iteritems()}

(This is effectively a form of the decorate-process-undecorate idiom without any processing.)

But instead of building up the dict and then filtering it, you can make things better (simpler, and faster, and more memory-efficient, and order-preserving) by filtering as you read. Basically, keep the set alongside the dict as you go along. For example, instead of this:

mydict = {}
for line in f:
    k, v = line.split(None, 1)
    mydict[k] = v

mapp = {}
value_holder = set()

for i in mydict:
    if mydict[i] not in value_holder:
        mapp[i] = mydict[i]
        value_holder.add(mydict[i])

Just do this:

mapp = {}
value_holder = set()
for line in f:
    k, v = line.split(None, 1)
    if v not in value_holder:
        mapp[k] = v
        value_holder.add(v)

In fact, you may want to consider writing a one_to_one_dict that wraps this up (or search PyPI modules and ActiveState recipes to see if someone has already written it for you), so then you can just write:

mapp = one_to_one_dict()
for line in f:
    k, v = line.split(None, 1)
    mapp[k] = v

Upvotes: 7

Gareth Latty

Reputation: 89017

The first way to speed this up, as others have mentioned, is a using a set to record seen values, as checking for membership on a set is much faster.

We can also make this a lot shorter with a dict comprehension:

seen = set()
new_mapp = {k: v for k, v in mapp.items() if v not in seen or seen.add(i)}

The if case requires a little explanation: we only add key/value pairs where we havn't seen the value before, but we use or a little bit hackishly to ensure any unseen values are added to the set. As set.add() returns None, it will not affect the outcome.

As always, in 2.x, user dict.iteritems() over dict.items().

Upvotes: 2

Axel Magnuson

Reputation: 1202

Part of your problem is that dicts do not preserve any sort of logical ordering when they are iterated through. They use hash tables to index items (see this great article). So there's no real concept of "first occurence of value" in this sort of data structure. The right way to do this would probably be a list of key-value pairs. e.g. :

kv_pairs = [(k1,v1),(k2,v2),...]

or, because the file is so large, it would be better to use the excellent file iteration python provides to retrieve the k/v pairs:

def kv_iter(f):
    # f being the file descriptor
    for line in f:
        yield ... # (whatever logic you use to get k, v values from a line)

Value_holder is a great candidate for a set variable. You are really just testing whether value_holder. Because values are unique, they can be indexed more efficiently using a similar hashing method. So it would end up a bit like this:

mapp = {}
value_holder = set()

for k,v in kv_iter(f):
    if v in value_holder:
       mapp[k] = v
       value_holder.add(v)

Upvotes: -1

steveha

Reputation: 76715

You said you are reading from a very large file and want to keep only the first occurrence of a key. I originally assumed this meant you care about the order in which the key/value pairs occurs in the very large file. This code will do that and will be fast.

values_seen = set()
mapp = {}
with open("large_file.txt") as f:
    for line in f:
        key, value = line.split()
        if value not in values_seen:
            values_seen.add(value)
            mapp[key] = value

You were using a list to keep track of what keys your code had seen. Searching through a list is very slow: it gets slower the larger the list gets. A set is much faster because lookups are close to constant time (don't get much slower, or maybe at all slower, the larger the list gets). (A dict also works the way a set works.)

Upvotes: -1

g.d.d.c

Reputation: 47988

Using a set instead of a list would speed you up considerably ...

Upvotes: 0

jdotjdot

Reputation: 17052

I'm not completely clear on exactly what you're doing, but set is a great way to remove duplicates. For example:

>>> k = [1,3,4,4,5,4,3,2,2,3,3,4,5]
>>> set(k)
set([1, 2, 3, 4, 5])
>>> list(set(k))
[1, 2, 3, 4, 5]

Though it depends a bit on the structure of the input that you're loading, there might be a way to simply use set so that you don't have to iterate through the entire object every time to see if there any matching keys--instead run it through set once.

Upvotes: 2

More efficient way to get unique first occurrence from a Python dict

Answers (6)

Related Questions