Alex
Alex

Reputation: 2219

Python: words replacing in huge text

I have a huge text and a list of words ~10K. What is the fastest way in Python to replace all this words in text with some other word?

EDIT: Text size >1Gb, text is human written, and "extremely tokenized" (any runs of alphanumeric characters and any other single symbols was splitted into new tokens)

a number of words >10K, each word frequency in text is 1 the replacement word is same in all replacements. Python 2.5-2.7

Upvotes: 2

Views: 2988

Answers (4)

g.d.d.c
g.d.d.c

Reputation: 47988

Input format and search / replace pairings info is going to necessary to refine this answer if it comes close to start with, but this would be my initial stab at it (assuming some form of regularity in the input data, space delimited in my example code below).

replacements = {
  's1': 'r1',
  's2': 'r2'
  ...
}

with open('input.txt') as fhi, open('output.txt', 'w') as fho:
  for line in fhi:
    words = line.split(' ')

    fho.write(' '.join(map(lambda w: replacements.get(w, w), words))

    # Or as a list comprehension from the comments.
    fho.write(' '.join([replacements.get(w, w) for w in words]))

The idea here is that we'll be relocating data into an output file from an input file. For each word of each line, we check to see if it's in our replacements dictionary. We retrieve the new value if it is, or return the word unchanged otherwise via the dict.get(key[, default]) method. This may not be ideal, doesn't handle punctuation, would probably have trouble on an input file that wasn't broken into lines, etc, but may be a way to get started.

Upvotes: 3

andref
andref

Reputation: 750

I'd suggest a simple approach, replacing one line a time:

pattern1 = 'foo'
pattern2 = 'bar'

with open('input.txt') as input, open('output.txt', 'w') as output:
    for line in input:
        output.write(line.replace(pattern1, pattern2))

Upvotes: 0

Zaur Nasibov
Zaur Nasibov

Reputation: 22659

Wow! This is not trivial at all. Here is an idea:

Step 1: Quantize the text into words, signs etc. 
        The function quantize accepts text as an argument, 
        the output is the list of words and signs. 
        def quantize(text: str) -> list: 
            ...
        An inverse function that can construct the a from a given list:
        def dequantize(lst: list) -> str:
            ....

Step 2: Build a dictionary of quantized list, so that 
        d_rep[word] = word
        Then, use the replacements word lists to transform this dictionary as follows:
        d_rep[word] = replacement

Step 3: Go through every word in quantized list and replace it with a value from 
        d_rep dictionary. It might be the original word or a replacement. 

Step 4: Dequantize the list and restore the text. 

This should be optimal enough if you have a big text and a huge amount of search/replace words. Good luck! Ask, if you have any implementation questions.

Update: With a single replacement word, it's even easier, create a set from the '10K' wordlist, and then for every word in quantized list, if word in set, replace it in that list.

In a pseudo-python-code:

qlist = quantize(text)

for i in range(0, len(qlist)):
    word = qlist[i]
    if word in wordlist_set:
        qlist[i] = 'replacement'

text = dequantize(qlist)

Upvotes: 1

MRAB
MRAB

Reputation: 20664

The fastest method, if you have enough memory, may be to read the text as a string and use regex to search for and perform the replacements:

def replace(matched):
    # Matched.group(0) is the word that was found
    # Return the replacement
    return "REPLACEMENT"

# The \b ensure that only whole words are matched.
text = re.sub(r"\b(%s)\b" % "|".join(words), replace, text)

If you don't have the memory, try doing it in chunks, perhaps:

# Read a chunk and a line to ensure that you're not truncating a word.
chunk = text_file.read(1024 ** 2) + text_file.readline()

Upvotes: 0

Related Questions