Reputation: 461
I have to loop over large file with 2mln lines, that looks like this
P61981 1433G_HUMAN
P61982 1433G_MOUSE
Q5RC20 1433G_PONAB
P61983 1433G_RAT
P68253 1433G_SHEEP
Currently I have the following function, that take the every entry in the list, and if the entry in this large file - it took the row with the occurence, but it's slow (~10min). Probably due to the looping scheme, can you please suggest optimization?
up = "database.txt"
def mplist(somelist):
newlist = []
with open(up) as U:
for row in U:
for i in somelist:
if i in row:
newlist.append(row)
return newlist
example of the somelist
somelist = [
'P68250',
'P31946',
'Q4R572',
'Q9CQV8',
'A4K2U9',
'P35213',
'P68251'
]
Upvotes: 0
Views: 130
Reputation: 1121914
If your somelist
only contains values found in the first column, then split the line and only test the first value against a set
, not a list
:
def mplist(somelist):
someset = set(somelist)
with open(up) as U:
return [line for line in U if line.split(None, 1)[0] in someset]
Testing against a set is a O(1) constant time operation (independent of the size of the set).
Demo:
>>> up = '/tmp/database.txt'
>>> open(up, 'w').write('''\
... P61981 1433G_HUMAN
... P61982 1433G_MOUSE
... Q5RC20 1433G_PONAB
... P61983 1433G_RAT
... P68253 1433G_SHEEP
... ''')
>>> def mplist(somelist):
... someset = set(somelist)
... with open(up) as U:
... return [line for line in U if line.split(None, 1)[0] in someset]
...
>>> mplist(['P61981', 'Q5RC20'])
['P61981 1433G_HUMAN\n', 'Q5RC20 1433G_PONAB\n']
You may want to return a generator instead, and only filter, not build a list in memory:
def mplist(somelist):
someset = set(somelist)
with open(up) as U:
return (line for line in U if line.split(None, 1)[0] in someset)
You can loop, but not index this result:
for match in mplist(somelist):
# do something with match
and not need to hold all matched entries in memory.
Upvotes: 6