Reputation: 4477
I have a very large file I'm parsing and getting the key value from the line. I want only the first key and value, for only one value. That is, I'm removing the duplicate values
So it would look like:
{
A:1
B:2
C:3
D:2
E:2
F:3
G:1
}
and it would output:
{E:2,F:3,G:1}
It's a bit confusing because I don't really care what the key is. So E in the above could be replaced with B or D, F could be replaced with C, and G could be replaced with A.
Here is the best way I have found to do it but it is extremely slow as the file gets larger.
mapp = {}
value_holder = []
for i in mydict:
if mydict[i] not in value_holder:
mapp[i] = mydict[i]
value_holder.append(mydict[i])
Must look through value_holder every time :( Is there a faster way to do this?
Upvotes: 4
Views: 5691
Reputation: 365747
Yes, a trivial change makes it much faster:
value_holder = set()
(Well, you also have to change the append
to add
. But still pretty simple.)
Using a set instead of a list means each lookup is O(1) instead of O(N), so the whole operation is O(N) instead of O(N^2). In other words, if you have 10,000 lines, you're doing 10,000 hash lookups instead of 50,000,000 comparisons.
One caveat with this solution—and all of the others posted—is that it requires the values to be hashable. If they're not hashable, but they are comparable, you can still get O(NlogN) instead of O(N^2) by using a sorted set (e.g., from the blist
library). If they're neither hashable nor sortable… well, you'll probably want to find some way to generate something hashable (or sortable) to use as a "first check", and then only walk the "first check" matches for actual matches, which will get you to O(NM), where M is the average number of hash collisions.
You might want to look at how unique_everseen
is implemented in the itertools
recipes in the standard library documentation.
Note that dictionaries don't actually have an order, so there's no way to pick the "first" duplicate; you'll just get one arbitrarily. In which case, there's another way to do this:
inverted = {v:k for k, v in d.iteritems()}
reverted = {v:k for k, v in inverted.iteritems()}
(This is effectively a form of the decorate-process-undecorate idiom without any processing.)
But instead of building up the dict
and then filtering it, you can make things better (simpler, and faster, and more memory-efficient, and order-preserving) by filtering as you read. Basically, keep the set
alongside the dict
as you go along. For example, instead of this:
mydict = {}
for line in f:
k, v = line.split(None, 1)
mydict[k] = v
mapp = {}
value_holder = set()
for i in mydict:
if mydict[i] not in value_holder:
mapp[i] = mydict[i]
value_holder.add(mydict[i])
Just do this:
mapp = {}
value_holder = set()
for line in f:
k, v = line.split(None, 1)
if v not in value_holder:
mapp[k] = v
value_holder.add(v)
In fact, you may want to consider writing a one_to_one_dict
that wraps this up (or search PyPI modules and ActiveState recipes to see if someone has already written it for you), so then you can just write:
mapp = one_to_one_dict()
for line in f:
k, v = line.split(None, 1)
mapp[k] = v
Upvotes: 7
Reputation: 89017
The first way to speed this up, as others have mentioned, is a using a set
to record seen values, as checking for membership on a set is much faster.
We can also make this a lot shorter with a dict comprehension:
seen = set()
new_mapp = {k: v for k, v in mapp.items() if v not in seen or seen.add(i)}
The if case requires a little explanation: we only add key/value pairs where we havn't seen the value before, but we use or
a little bit hackishly to ensure any unseen values are added to the set. As set.add()
returns None
, it will not affect the outcome.
As always, in 2.x, user dict.iteritems()
over dict.items()
.
Upvotes: 2
Reputation: 1202
Part of your problem is that dicts do not preserve any sort of logical ordering when they are iterated through. They use hash tables to index items (see this great article). So there's no real concept of "first occurence of value" in this sort of data structure. The right way to do this would probably be a list of key-value pairs. e.g. :
kv_pairs = [(k1,v1),(k2,v2),...]
or, because the file is so large, it would be better to use the excellent file iteration python provides to retrieve the k/v pairs:
def kv_iter(f):
# f being the file descriptor
for line in f:
yield ... # (whatever logic you use to get k, v values from a line)
Value_holder is a great candidate for a set variable. You are really just testing whether value_holder. Because values are unique, they can be indexed more efficiently using a similar hashing method. So it would end up a bit like this:
mapp = {}
value_holder = set()
for k,v in kv_iter(f):
if v in value_holder:
mapp[k] = v
value_holder.add(v)
Upvotes: -1
Reputation: 76715
You said you are reading from a very large file and want to keep only the first occurrence of a key. I originally assumed this meant you care about the order in which the key/value pairs occurs in the very large file. This code will do that and will be fast.
values_seen = set()
mapp = {}
with open("large_file.txt") as f:
for line in f:
key, value = line.split()
if value not in values_seen:
values_seen.add(value)
mapp[key] = value
You were using a list
to keep track of what keys your code had seen. Searching through a list
is very slow: it gets slower the larger the list gets. A set
is much faster because lookups are close to constant time (don't get much slower, or maybe at all slower, the larger the list gets). (A dict
also works the way a set
works.)
Upvotes: -1
Reputation: 47988
Using a set
instead of a list
would speed you up considerably ...
Upvotes: 0
Reputation: 17052
I'm not completely clear on exactly what you're doing, but set
is a great way to remove duplicates. For example:
>>> k = [1,3,4,4,5,4,3,2,2,3,3,4,5]
>>> set(k)
set([1, 2, 3, 4, 5])
>>> list(set(k))
[1, 2, 3, 4, 5]
Though it depends a bit on the structure of the input that you're loading, there might be a way to simply use set
so that you don't have to iterate through the entire object every time to see if there any matching keys--instead run it through set
once.
Upvotes: 2