Jimmy C
Jimmy C

Reputation: 9680

Speed issues when converting list to dictionary

I'm having some speed issues regarding the conversion of lists to dictionaries, where the following operation takes up about 90% of the total running time:

def list2dict(list_):
    return_dict = {}

    for idx, word in enumerate(list_):
        if word in return_dict:
            raise ValueError("duplicate string found in list: %s" % (word))
        return_dict[word] = idx

    return return_dict

I'm having troubles seeing what it is exactly that causes this. Are there any obvious bottlenecks that you see in the code, or suggestions on how to speed it up?

Thanks.

Upvotes: 4

Views: 156

Answers (5)

BWStearns
BWStearns

Reputation: 2706

This came in at avg 2.36 µs per run. Because the OPs code looked like it was favoring the first incident of a value I reversed the range so that it doesn't need any logic to check for the existence of values.

def mkdct(dct):
    l = len(dct-1)
    return {dct[x]:(l-x) for x in range(len(dct), -1, -1)}

Edit: there were some dumb mistakes there.

Upvotes: 0

U2EF1
U2EF1

Reputation: 13259

Using pandas seems to speed it up by a third, though as said it's already possibly "fast enough"

> TEST = # english scrabble dictionary, 83882 entries
> def mylist2dict(list_):
      return_dict = pd.Series(index=list_, data=np.arange(len(list_)))
      if not return_dict.index.is_unique:
          raise ValueError
      return return_dict
> %timeit list2dict(TEST)
10 loops, best of 3: 28.8 ms per loop
> %timeit mylist2dict(TEST)
100 loops, best of 3: 18.8 ms per loop

Upvotes: 0

MyGGaN
MyGGaN

Reputation: 1806

The fastest function depends on the length of list_. adsmith's list2dictC() is pretty fast for small lists (less than ~80 items). But when the list size grow I find my list2dictE() about 8% faster.

def list2dictC(list_):  # dict comp over enumerate(list)
    return_dict = {word: idx for idx, word in enumerate(list_)}
    if len(return_dict) == len(list_):
        return return_dict
    else:
        raise ValueError("Duplicate string found in list")

def list2dictE(list_):  # Faster for lists with ~80 items or more
    l = len(list_)
    return_dict = dict(zip(list_, range(l)))
    if len(return_dict) == l:
        return return_dict
    else:
        raise ValueError("Duplicate string found in list")

If the length is known to be small keep it's of no use, but if it's not the case maybe add something like l = len(list_); if l < 80: ... else: .... It's just an additional if statement since both functions need to know the list length anyways. The threshold of 80 items may very depending on your setup, but it's in that bollpark for me both for python 2.7 and 3.3.

Upvotes: 1

Adam Smith
Adam Smith

Reputation: 54213

EDIT:

Figured I'd put this up top since it's bigger -- turns out that a minor tweak to OP's code gives a pretty big bump in performance.

def list2dict(list_):    # OLD
    return_dict = {}
    for idx, word in enumerate(list_):
        if word in return_dict: # this compare is happening every iteration!
            raise ValueError("duplicate string found in list: %s" % (word))
        return_dict[word] = idx
    return return_dict

def list2dictNEW(list_): #NEW HOTNESS
    return_dict = {}
    for idx, word in enumerate(list_):
        return_dict[word] = idx # overwrite if you want to, because...
    if len(return_dict) == len(list_): return return_dict
    # if the lengths aren't the same, something got overwritten so we
    # won't return. If they ARE the same, toss it back with only one
    # compare (rather than n compares in the original
    else: raise ValueError("There were duplicates in list {}".format(list_))

DEMO:
>>> timeit(lambda: list2dictNEW(TEST))
1.9117132451798682
>>> timeit(lambda: list2dict(TEST)):
2.2543816669587216
# gains of a third of a second per million iterations!
# that's a 15.2% speed bost

No obvious answers, but you could try something like:

def list2dict(list_):
    return_dict = dict()
    for idx,word in enumerate(list_):
        return_dict.setdefault(word,idx)
    return return_dict

You could also build a set and do list.index since you say the lists are fairly small, but I'm GUESSING that would be slower rather than faster. This would need profiling to be sure (use timeit.timeit)

def list2dict(list_):
    set_ = set(list_)
    return {word:list_.index(word) for word in set_}

I took the liberty of running some profiles on a set of test data. Here are the results:

TEST = ['a','b','c','d','e','f','g','h','i','j'] # 10 items

def list2dictA(list_): # build set and index word
    set_ = set(list_)
    return {word:list_.index(word) for word in set_}

def list2dictB(list_): # setdefault over enumerate(list)
    return_dict = dict()
    for idx,word in enumerate(list_):
        return_dict.setdefault(word,idx)
    return return_dict

def list2dictC(list_): # dict comp over enumerate(list)
    return_dict = {word:idx for idx,word in enumerate(list_)}
    if len(return_dict) == len(list_):
        return return_dict
    else:
        raise ValueError("Duplicate string found in list")

def list2dictD(list_): # Original example from Question
    return_dict = {}
    for idx, word in enumerate(list_):
        if word in return_dict:
            raise ValueError("duplicate string found in list: %s" % (word))
        return_dict[word] = idx
    return return_dict

>>> timeit(lambda: list2dictA(TEST))
5.336584700190931
>>> timeit(lambda: list2dictB(TEST))
2.7587691306531
>>> timeit(lambda: list2dictC(TEST))
2.1609074989233292
>>> timeit(lambda: list2dictD(TEST))
2.2543816669587216

Upvotes: 1

blakev
blakev

Reputation: 4459

def l2d(list_):
    dupes = set(filter(lambda x: a.count(x) > 1, a))
    if len(dupes) > 0:
        raise ValueError('duplicates: %s' % (dupes))
    return dict((word, idx) for (idx, word) in enumerate(list_))

How does this compare?

Upvotes: 0

Related Questions