pandagrammer
pandagrammer

Reputation: 871

Replacing string with id using dictionary in python

I have a dictionary file that contains a word in each line.

titles-sorted.txt

 a&a    
 a&b    
 a&c_bus    
 a&e    
 a&f    
 a&m    
 ....

For each word, its line number is the word's id.

Then I have another file that contains a set of words separated by tab in each line.

a.txt

 a_15   a_15_highway_(sri_lanka)    a_15_motorway   a_15_motorway_(germany) a_15_road_(sri_lanka)

I'd like to replace all of the words by id if it exists in the dictionary, so that the output looks like,

    3454    2345    123   5436     322 .... 

So I wrote such python code to do this:

 f = open("titles-sorted.txt")
 lines = f.readlines()
 titlemap = {}
 nr = 1
 for l in lines:
     l = l.replace("\n", "")
     titlemap[l.lower()] = nr
     nr+=1

 fw = open("a.index", "w")
 f = open("a.txt")
 lines = f.readlines()
 for l in lines:
     tokens = l.split("\t")
     if tokens[0] in titlemap.keys():
            fw.write(str(titlemap[tokens[0]]) + "\t")
            for t in tokens[1:]:
                    if t in titlemap.keys():
                            fw.write(str(titlemap[t]) + "\t")
            fw.write("\n")

 fw.close()
 f.close()

But this code is ridiculously slow, so it makes me suspicious if I have done everything right.

Is this an efficient way to do this?

Upvotes: 3

Views: 629

Answers (3)

cdlane
cdlane

Reputation: 41925

If we apply the suggestions so far and clean up your code some more (e.g. remove unnecessary .keys() calls), is the following still too slow for your needs?

title_map = {}

token_file = open("titles-sorted.txt")

for number, line in enumerate(token_file):
    title_map[line.rstrip().lower()] = str(number + 1)

token_file.close()

input_file = open("a.txt")
output_file = open("a.index", "w")

for line in input_file:
    tokens = line.split("\t")

    if tokens[0] in title_map:
        output_list = [title_map[tokens[0]]]
        output_list.extend(title_map[token] for token in tokens[1:] if token in title_map)
        output_file.write("\t".join(output_list) + "\n")

output_file.close()
input_file.close()

If it's still too slow, give us slightly more data to work with including an estimate of the number of lines in each of your two input files.

Upvotes: 0

Alex Alifimoff
Alex Alifimoff

Reputation: 1849

So, I suspect this differs based on the operating system you're running on and the specific python implementation (someone wiser than I may be able to provide some clarify here), but I have a suspicion about what is going on:

Every time you call write, some amount of your desired write request gets written to a buffer, and then once the buffer is full, this information is written to file. The file needs to be fetched from your hard disk (as it doesn't exist in main memory). So your computer pauses while it waits the several milliseconds that it takes to fetch the block from the harddisk and writes to it. On the other hand, you can do the parsing of the string and the lookup to your hashmap in a couple of nanoseconds, so you spend a lot of time waiting for the write request to finish!

Instead of writing immediately, what if you instead kept a list of the lines that you wanted to write and then only wrote them at the end, all in a row, or if you're handling a huge, huge file that will exceed the capacity of your main memory, write it once you have parsed a certain number of lines.

This allows the writing to disk to be optimized, as you can write multiple blocks at a time (again, this depends on how Python and the operating system handle the write call).

Upvotes: 1

njzk2
njzk2

Reputation: 39403

The write loop contains a lot of calls to write, which are usually inefficient. You can probably speed things up by writing only once per line (or once per file if the file is small enough)

tokens = l.split("\t")
fw.write('\t'.join(fw.write(str(titlemap[t])) for t in tokens if t in titlemap)
fw.write("\n")

or even:

lines = []
for l in f:
    lines.append('\t'.join(fw.write(str(titlemap[t])) for t in l.split('\t') if t in titlemap)
fw.write('\n'.join(lines))

Also, if your tokens are used more than once, you can save time by converting them to string when you read then:

titlemap = {l.strip().lower(): str(index) for index, l in enumerate(f, start=1)}

Upvotes: 4

Related Questions