Gordafarid
Gordafarid

Reputation: 183

Text replacement doesn't work in special cases

I have a word list file, named Words.txt, which contains hundreds of words, and a few subtitle files (.srt). I would like to go through all subtitle files, and search them for all of the words in the word list file. If a word is found, I'd like to change it's color to green. This is the code:

import fileinput
import os
import re

wordsPath = 'C:/Users/John/Desktop/Subs/Words.txt'
subsPath = 'C:/Users/John/Desktop/Subs/Season1'
wordList = []

wordFile = open(wordsPath, 'r')
for line in wordFile:
    line = line.strip()
    wordList.append(line)

for word in wordList:
    for root, dirs, files in os.walk(subsPath, topdown=False):
        for fileName in files:
            if fileName.endswith(".srt"):
                with open(fileName, 'r') as file :
                    filedata = file.read()
                    filedata = filedata.replace(' '  +word+  ' ', ' ' + '<font color="Green">' +word+'</font>' + ' ')
                with open(fileName, 'w') as file:
                    file.write(filedata)

Say the word "book" is in the list and is found in one of the subtitle files. As long as this word is in the sentence like "This book is amazing", my code works perfectly fine. However, when the word is mentioned like "BOOK", "Book", and when it is at the begging or at the end of a sentence, the code fails. How can I solve this problem?

Upvotes: 0

Views: 56

Answers (1)

Dani Mesejo
Dani Mesejo

Reputation: 61910

You are using str.replace, from the documentation:

Return a copy of the string with all occurrences of substring old replaced by new

Here an occurrence means an exact match of the string old, then the function will try to replace a word surrounded by whitespaces, for example ' book ' that is different than ' BOOK ', ' Book ' and ' book'. Let's see a few cases that also don't match:

" book " == " BOOK "  # False
" book " == " book"  # False
" book " == " Book "  # False
" book " == " bOok " # False
" book " == "   book " # False

One alternative is to use a regex like this:

import re

words = ["book", "rule"]
sentences = ["This book is amazing", "The not so good book", "OMG what a great BOOK", "One Book to rule them all",
             "Just book."]

patterns = [re.compile(r"\b({})\b".format(word), re.IGNORECASE | re.UNICODE) for word in words]
replacements = ['<font color="Green">' + word + '</font>' for word in words]

for sentence in sentences:

    result = sentence[:]
    for pattern, replacement in zip(patterns, replacements):
        result = pattern.sub(r'<font color="Green">\1</font>', result)
    print(result)

Output

This <font color="Green">book</font> is amazing
The not so good <font color="Green">book</font>
OMG what a great <font color="Green">BOOK</font>
One <font color="Green">Book</font> to <font color="Green">rule</font> them all
Just <font color="Green">book</font>.

Upvotes: 1

Related Questions