user3575737
user3575737

Reputation: 79

Iterate and match in large files in python

I have two large text files:

’letters’ with > 600 000 lines

’numbers’ with > 100 000 lines

letters looks like this:

AAAA
AAAB
AAAC
etc…

numbers have two columns, one ’id’ and one with a list of numbers:

    id1 5, 201, 66, 33
    id2 356
    id3 5103, 2, 452
    etc…

I want each row in ’letters’ to represent a number:

    1   AAAA
    2    AAAB
    etc…

and then check what row in ’numbers’ contains that number, then pair that id with the combination of letters, in this case only:

AAAB    id3

This script takes days to run:

combine = {}
for i, x in enumerate(letters):
    for id, number in numbers.items():
        if i+1 in number:
            combine[x['letter']] = id

Is there a faster way to do it?

Upvotes: 0

Views: 162

Answers (3)

jfs
jfs

Reputation: 414315

1e5 and 6e5 are not large numbers if you use a linear time algorithm instead of the quadratic one:

#!/usr/bin/env python
with open('letters') as file:
    letters = file.read().splitlines()

def combine_ids(letters):
    with open('numbers') as file:
        for line in file:
            id, space, numbers_str = line.lstrip().partition(' ')
            try:
                numbers = list(map(int, numbers_str.split(',')))
            except ValueError:
                continue # skip invalid lines
            for n in numbers:
                try:
                    yield letters[n], id
                except IndexError:
                    pass

result = dict(combine_ids(letters))
print(result)

If multiple ids may correspond to the same letter (if there are duplicate numbers in numbers file) then the latest number wins.

Example

numbers:

id1 5, 33
id2 23
id3 103, 2, 3

letters:

AAAA
AAAB
AAAC
AAAD
AAAE
AAAF
AAAG
AAAH
AAAI
AAAJ
AAAK
...
AAAX
AAAY
AAAZ

Output

{'AAAX': 'id2', 'AAAC': 'id3', 'AAAD': 'id3', 'AAAF': 'id1'}

Note: the number 2 corresponds to AAAC here (zero-based indexing), use letters[n-1] if letters should be indexed from 1 (assuming n>=1).

Upvotes: 2

Klas Lindbäck
Klas Lindbäck

Reputation: 33273

You can do it with a single pass over each file, O(N):

  1. Read letters file into array. You will get array index (+ 1?) = Line number.

  2. Read numbers file. For each row: Use the numbers to combine id with letters from array.

Upvotes: 1

hyades
hyades

Reputation: 3170

Store all letters in a list letters = ["AAAA", "AAAB", "AAAC", ...].

Now after reading in the numbers file create a mapping like

0 mapped to id1
1 mapped to id2
....

m[0] = "id1", m[1] = "id2"...

While doing the above step, create a array of zeroes, read in the numbers file and assign the row to which mapping it belongs

p = [0] * len(letters)
nums = row[row.find(" ") + 1:].split(", ")
row_name = row[:row.find(" ") - 1]
for num in nums:
    p[num] = m[row_name]

Now for finding the ith letter in letters list and its number just do

print p[i]

Upvotes: 1

Related Questions