Chris
Chris

Reputation: 23

How to inner-join a list of tuples with a list?

I have a lists of tuples (>150k elements) containing an ID with the length of that ID. However, this list should only show IDs that appear in the second list (>80k).

list_of_tuples = [('1', 31.46), ('10', 97.99), ('50', 71.19), ('100', 17.03), ...]
normal_list = ['1', '50', '100', ...]

The desired output is:

list_of_tuples = [('1', 31.46), ('50', 71.19), ('100', 17.03), ...]

Here's the code that I threw together for testing the concept, but as I am new to Python, it doesn't work. I also haven't found a solution online for this kind of issue.

    for whole_elem in list_of_tuples:
        for first_elem in whole_elem:
            for link in normal_list:
                if first_elem <> link
                list_of_tuples.pop(whole_elem)

I would appreciate your support a lot. Thank you very much!

Upvotes: 1

Views: 696

Answers (3)

Mark Reed
Mark Reed

Reputation: 95242

Does a list comprehension not work here?

filtered_tuples = [tpl for tpl in list_of_tuples if tpl[0] in normal_list]

Since your list is so big, you probably want to turn the normal_list into a set to improve performance:

normal_set = frozenset(normal_list)
filtered_tuples = [tpl for tpl in list_of_tuples if tpl[0] in normal_set]

Upvotes: 3

sophros
sophros

Reputation: 16660

You can probably solve this conceptually at the same level as asked ("inner join") with pandas join function (look at this question).

However, here I would just do the following:

result = []
normal_set = set(normal_list)  # improves performance of 'contains' check from len(normal_list)/2 (on avarage) to log(len(normal_list)
for tpl in list_of_tuples:
   if tpl[0] in normal_set:
      result.append(tpl)

On top of that, you can yield elements of the result instead of append if you do not need to consume the result as a whole:

def inner_join(normal_list, list_of_tuples):
    result = []
    normal_set = set(normal_list)
    for tpl in list_of_tuples:
       if tpl[0] in normal_set:
          yield tpl

Which you use as:

for el in innner_join(normal_list, list_of_tuples):
    ....

Upvotes: 1

Ry-
Ry-

Reputation: 224886

for first_elem in whole_elem:

The first element in whole_elem is whole_elem[0]. for first_elem in whole_elem would be a loop over all of its elements.

first_elem = whole_elem[0]
for link in normal_list:
    if first_elem <> link

This runs the body of the if every time there’s an element in normal_list that isn’t equal to first_elem. Instead, you want to check whether normal_list contains first_elem.

if first_elem not in normal_list:
dict_link_length.pop(whole_elem)

Not sure what dict_link_length is. Producing a new list is probably a better idea than modifying the old one, though.

filtered = []

for whole_elem in list_of_tuples:
    first_elem = whole_elem[0]

    if first_elem in normal_list:
        filtered.append(whole_elem)

Now pick more descriptive names for things (a list of tuples is a normal list). Here’s a start:

select_ids = ['1', '50', '100', ...]
filtered = []

for item in items:
    item_id = item[0]

    if item_id in select_ids:
        filtered.append(item)

Checking if an element is in a list with in is slow because it has to check the entire list one at a time, but checking if an element is in a set is constant-time:

select_ids = ['1', '50', '100', ...]
filtered = []

select_ids = frozenset(select_ids)

for item in items:
    item_id = item[0]

    if item_id in select_ids:
        filtered.append(item)

Finally, this can be written as a list comprehension:

select_ids = frozenset(select_ids)
filtered = [item for item in items if item[0] in select_ids]

Upvotes: 2

Related Questions