Reputation: 14684
I have a problem with two very large files(each more then 1.000.000 entries) in python: I need to generate a filter and I dont know why, I have two files like this:
1,2,3
2,4,5
3,3,4
and the second
1,"fege"
2,"greger"
4,"feffg"
the first item of each file row is always the ID. Now I want to filter the Lists, that the first list only contains items which ID's are in the second file. For this example the result should be:
1,2,3
2,4,5
how to make this in a very fast way? the core problem is, that each list is very very long. I used s.th. like this:
[row for row in myRows if row[0] == item[0]]
but this take a long time to run throw. (more than 30 days)
Upvotes: 3
Views: 480
Reputation: 363607
[row for row in myRows if row[0] == item[0]]
is doing a linear scan for each item
. If you use a set
instead, you can bring this down to an expected constant time operation. First, read in the second file to get a set
of valid ids:
with open("secondfile") as f:
# note: only storing the ids, not the whole line
valid_ids = set(ln.split(',', 1)[0] for ln in f)
Then you can filter the lines of the first file using the set valid_ids
as
with open("firstfile") as f:
matched_rows = [ln for ln in f if ln.split(',')[0] in valid_ids]
Upvotes: 7
Reputation: 10882
I assume you are only interested in the first field. If so, you could try something like:
def _id(s):
return s[:s.index(',')]
ids = {}
for line in open('first-file'):
ids[_id(line)] = line
for line in open('second-file'):
k = _id(line)
if k in ids:
print ids[k]
Upvotes: 1