gurehbgui
gurehbgui

Reputation: 14684

how to make a efficient filter in python

I have a problem with two very large files(each more then 1.000.000 entries) in python: I need to generate a filter and I dont know why, I have two files like this:

1,2,3
2,4,5
3,3,4

and the second

1,"fege"
2,"greger"
4,"feffg"

the first item of each file row is always the ID. Now I want to filter the Lists, that the first list only contains items which ID's are in the second file. For this example the result should be:

1,2,3
2,4,5

how to make this in a very fast way? the core problem is, that each list is very very long. I used s.th. like this:

[row for row in myRows if row[0] == item[0]]

but this take a long time to run throw. (more than 30 days)

Upvotes: 3

Views: 480

Answers (2)

Fred Foo
Fred Foo

Reputation: 363607

[row for row in myRows if row[0] == item[0]]

is doing a linear scan for each item. If you use a set instead, you can bring this down to an expected constant time operation. First, read in the second file to get a set of valid ids:

with open("secondfile") as f:
    # note: only storing the ids, not the whole line
    valid_ids = set(ln.split(',', 1)[0] for ln in f)

Then you can filter the lines of the first file using the set valid_ids as

with open("firstfile") as f:
    matched_rows = [ln for ln in f if ln.split(',')[0] in valid_ids]

Upvotes: 7

amit
amit

Reputation: 10882

I assume you are only interested in the first field. If so, you could try something like:

def _id(s):
  return s[:s.index(',')]

ids = {}
for line in open('first-file'):
 ids[_id(line)] = line
for line in open('second-file'):
 k = _id(line)
 if k in ids:
  print ids[k]

Upvotes: 1

Related Questions