Reputation: 79
I have two large text files:
’letters’ with > 600 000 lines
’numbers’ with > 100 000 lines
letters looks like this:
AAAA
AAAB
AAAC
etc…
numbers have two columns, one ’id’ and one with a list of numbers:
id1 5, 201, 66, 33 id2 356 id3 5103, 2, 452 etc…
I want each row in ’letters’ to represent a number:
1 AAAA 2 AAAB etc…
and then check what row in ’numbers’ contains that number, then pair that id with the combination of letters, in this case only:
AAAB id3
This script takes days to run:
combine = {}
for i, x in enumerate(letters):
for id, number in numbers.items():
if i+1 in number:
combine[x['letter']] = id
Is there a faster way to do it?
Upvotes: 0
Views: 162
Reputation: 414315
1e5 and 6e5 are not large numbers if you use a linear time algorithm instead of the quadratic one:
#!/usr/bin/env python
with open('letters') as file:
letters = file.read().splitlines()
def combine_ids(letters):
with open('numbers') as file:
for line in file:
id, space, numbers_str = line.lstrip().partition(' ')
try:
numbers = list(map(int, numbers_str.split(',')))
except ValueError:
continue # skip invalid lines
for n in numbers:
try:
yield letters[n], id
except IndexError:
pass
result = dict(combine_ids(letters))
print(result)
If multiple ids may correspond to the same letter (if there are duplicate numbers in numbers
file) then the latest number wins.
numbers:
id1 5, 33
id2 23
id3 103, 2, 3
letters:
AAAA
AAAB
AAAC
AAAD
AAAE
AAAF
AAAG
AAAH
AAAI
AAAJ
AAAK
...
AAAX
AAAY
AAAZ
{'AAAX': 'id2', 'AAAC': 'id3', 'AAAD': 'id3', 'AAAF': 'id1'}
Note: the number 2
corresponds to AAAC
here (zero-based indexing), use letters[n-1]
if letters should be indexed from 1
(assuming n>=1
).
Upvotes: 2
Reputation: 33273
You can do it with a single pass over each file, O(N):
Read letters file into array. You will get array index (+ 1?) = Line number.
Read numbers file. For each row: Use the numbers to combine id
with letters from array.
Upvotes: 1
Reputation: 3170
Store all letters in a list letters = ["AAAA", "AAAB", "AAAC", ...]
.
Now after reading in the numbers file create a mapping like
0 mapped to id1
1 mapped to id2
....
m[0] = "id1", m[1] = "id2"...
While doing the above step, create a array of zeroes, read in the numbers file and assign the row to which mapping it belongs
p = [0] * len(letters)
nums = row[row.find(" ") + 1:].split(", ")
row_name = row[:row.find(" ") - 1]
for num in nums:
p[num] = m[row_name]
Now for finding the ith letter in letters
list and its number just do
print p[i]
Upvotes: 1