how to make a efficient filter in python

Question

I have a problem with two very large files(each more then 1.000.000 entries) in python: I need to generate a filter and I dont know why, I have two files like this:

1,2,3
2,4,5
3,3,4

and the second

1,"fege"
2,"greger"
4,"feffg"

the first item of each file row is always the ID. Now I want to filter the Lists, that the first list only contains items which ID's are in the second file. For this example the result should be:

1,2,3
2,4,5

how to make this in a very fast way? the core problem is, that each list is very very long. I used s.th. like this:

[row for row in myRows if row[0] == item[0]]

but this take a long time to run throw. (more than 30 days)

Fred Foo · Accepted Answer

[row for row in myRows if row[0] == item[0]]

is doing a linear scan for each item. If you use a set instead, you can bring this down to an expected constant time operation. First, read in the second file to get a set of valid ids:

with open("secondfile") as f:
    # note: only storing the ids, not the whole line
    valid_ids = set(ln.split(',', 1)[0] for ln in f)

Then you can filter the lines of the first file using the set valid_ids as

with open("firstfile") as f:
    matched_rows = [ln for ln in f if ln.split(',')[0] in valid_ids]

how to make a efficient filter in python

Answers (2)

Related Questions