Penguin
Penguin

Reputation: 2411

How to efficiently find a dictionary value based on another value in a list of dictionaries

I have a very large (~100k) list of dictionaries:

[{'sequence': 'read the rest of this note', 'score': 0.22612378001213074, 'token': 3805, 'token_str': 'note'}, {'sequence': 'read the rest of this page', 'score': 0.11293990164995193, 'token': 3674, 'token_str': 'page'}, {'sequence': 'read the rest of this week', 'score': 0.06504543870687485, 'token': 1989, 'token_str': 'week'}]

Given a token ID (e.g 1989), how can I find the corresponding score in an efficient way? I have to do this multiple times for each list (I have several of these large lists and for each one I have several token IDs).

I'm currently iterating through each dictionary in the list and checking if the ID matches my input ID, and if it does I'm getting the score. But it's quite slow.

Upvotes: 1

Views: 1592

Answers (3)

Pedro Maia
Pedro Maia

Reputation: 2722

Since you have to search multiple times maybe create a single dictionary with the token as the key:

a = [{'sequence': 'read the rest of this note', 'score': 0.22612378001213074, 'token': 3805, 'token_str': 'note'}, {'sequence': 'read the rest of this page', 'score': 0.11293990164995193, 'token': 3674, 'token_str': 'page'}, {'sequence': 'read the rest of this week', 'score': 0.06504543870687485, 'token': 1989, 'token_str': 'week'}]

my_dict = {i['token']: i for i in a}

It would take some time to create the dict but after every search would be O(1).

This might seem inefficient but python handles memory in a very efficient way, so instead of creating the same dictionary already on the list on the new dict it actually holds a reference to the dict already constructed on the list, you can confirm that using:

>>> a[0] is my_dict[3805]
True

So you can interpret that as creating an aliases for each element in the list.

Upvotes: 5

Booboo
Booboo

Reputation: 44108

If your list of dictionaries is:

l = [{'sequence': 'read the rest of this note', 'score': 0.22612378001213074, 'token': 3805, 'token_str': 'note'}, {'sequence': 'read the rest of this page', 'score': 0.11293990164995193, 'token': 3674, 'token_str': 'page'}, {'sequence': 'read the rest of this week', 'score': 0.06504543870687485, 'token': 1989, 'token_str': 'week'}]

And the values of token you are interested in are, for example:

token_values = [1989, 30897, 98762]

Then:

Build a dictionary as follows:

d = {the_dict['token']: the_dict['score']
    for the_dict in l where the_dict['token'] in token_values}

This will build a minimal dictionary containing just the key values you are interested in with their corresponding scores.

Upvotes: 0

norbot
norbot

Reputation: 267

Using pandas might be more efficient for large datasets.

An example for finding the score with the token 3805:

import pandas as pd

source_list = [{'sequence': 'read the rest of this note', 'score': 0.22612378001213074, 'token': 3805, 'token_str': 'note'}, {'sequence': 'read the rest of this page', 'score': 0.11293990164995193, 'token': 3674, 'token_str': 'page'}, {'sequence': 'read the rest of this week', 'score': 0.06504543870687485, 'token': 1989, 'token_str': 'week'}]

df = pd.DataFrame(source_list)
result = df[df.token == 3805]

print(result.score.values[0])

Upvotes: 0

Related Questions