bob_the_bob
bob_the_bob

Reputation: 381

Regex to replace uppercase I at end of words

I am trying to figure out how to replace an uppercase I with lowercase l from the end of words in a subtitle file. Below is a sample of the srt file. The regex need to ignore words like FBI which are in an ignore_list. It works ok except for words with a period, comma, question mark etc.

*1
00:00:14, 391 --> 00:00:15, 976
He he'lI  crawI InterdimensionaI,

2
00:00:17, 352 --> 00:00:18, 353
Who, are you a belI? I? I am on a hilI,

3
00:00:17, 352 --> 00:00:18, 353
I walking up hi'lI, and rolI all the way down.

4
00:00:17, 352 --> 00:00:18, 353
I fe'lI and we'lI go again to the FBI.*

## replace I with l
ignore_list = ["FBI"]

lines = []
                            
## load lines into list
with open(file) as f:
    [lines.append(line) for line in f]

## for each line in the list, split into words
for i in lines:
    words = i.split(" ")

    for i in words:

        if i not in ignore_list:

            if i.endswith("I"):

                print (re.sub(r'I\b', 'l', i))

Upvotes: 1

Views: 74

Answers (3)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627335

To only replace I at the end of words other than the whole words in your ignore_list, you may use

import re

ignore_list = ["FBI", "I"]
pattern = re.compile(rf"\b(?:{'|'.join(ignore_list)})\b|(I)\b") # \b(?:FBI|I)\b|(I)\b

# If the ignore list can contain ANY arbitrary text
# pattern = re.compile(rf"(?!\B\w)(?:{'|'.join(sorted(map(re.escape, ignore_list), key=len, reverse=True))})(?!\B\w)|(I)\b")
# (?!\B\w)(?:FBI|I)(?!\B\w)|(I)\b

lines = []
                      
with open(file) as f:
    [lines.append(line) for line in f]

result = [pattern.sub(lambda x: "l" if x.group(1) else x.group(), i) for i in lines]

See the Python demo. If your ignore_list is too long, you may build a regex TRIE.

How it works:

  • \b(?:FBI|I)\b|(I)\b like pattern matches FBI, I and any other word from your list as a whole word, or it matches I at the end of a word and places it into Group 1.
  • pattern.sub(lambda x: "l" if x.group(1) else x.group(), i) replaces the match with l only if Group 1 matched, else, no replacement occurs.

If your "words" in the ignore list can have multiple words and can contain special characters, use the commented pattern version.

Upvotes: 1

rich neadle
rich neadle

Reputation: 383

In this example, I add the words from the ignore_list directly into the regex pattern:
So, that the last letter of those words in the ignore_list, uppercase I ('i'), is ignored when we are replacing every uppercase I('i') with a lowercase lowercase l ('L'), when the uppercase I ('i') is the last letter of a word in the text.

PYTHON CODE with re module

import re

ignore_list = ["FBI", "CSI", "API"]

# Create the ignore_pattern (regex) from the words on the ignore_list
ignore_pattern = r""
for item in ignore_list: 
    # Remove last letter of each item and insert what is left it into the lookbehind regex
    # pattern. Add the lookbehind into the ignore_pattern.
    ignore_pattern = ignore_pattern + r"(?<!\b{})".format(item[:-1])

# Insert the ignore_pattern into the regex pattern.     
pattern = r"{}\BI\b".format(ignore_pattern)
replacement = r"l"

# Perform the substitutions to create a new desired text string
new_text = re.sub(pattern, replacement, text)

EFFECTIVE REGEX PATTERNS FOR THIS EXAMPLE

ignore_pattern = r"(?<!\bFB)(?<!\bCS)(?<!\bAP)"

pattern = r"(?<!\bFB)(?<!\bCS)(?<!\bAP)\BI\b"

Regex demo: https://regex101.com/r/PS9FTZ/1

NOTES:

  • (?<!\bFB) Negative lookbehind (?<!...). Matches when there is NOT: a word boundary (\b), followed by a literal F and a literal B preceding this point. (NOTE: You can have multiple adjacent lookahead and lookbehind patterns at the same index (the order does not matter) and they all must match (be true) for the regex to proceed matching. In this regex, we have three negative lookbehinds. Each of the lookbehids as the word minus its last letter, uppercase I. E.g. "FBI" -> "FB", "API" -> "AP", etc..)
  • \B Not-word boundary. Matches when there is NOT a word boundary \b, i.e. in the middle of the word.
  • I Matches literal I.
  • \b Matches a word boundary.

TEST STRING

text = """
*1
00:00:14, 391 --> 00:00:15, 976
He he'lI  crawI InterdimensionaI,

2
00:00:17, 352 --> 00:00:18, 353
Who, are you a belI? I? I am on a hilI,

3
00:00:17, 352 --> 00:00:18, 353
I walking up hi'lI, and rolI all the way down.

4
00:00:17, 352 --> 00:00:18, 353
I fe'lI and we'lI go again to the FBI.

5
00:00:19, 377 --> 00:00:20, 378
What is the CSI?

5
00:00:20, 378 --> 00:00:21, 379
I wilI go to see API demo.
"""

RESULT (new_text)

*1
00:00:14, 391 --> 00:00:15, 976
He he'll  crawl Interdimensional,

2
00:00:17, 352 --> 00:00:18, 353
Who, are you a bell? I? I am on a hill,

3
00:00:17, 352 --> 00:00:18, 353
I walking up hi'll, and roll all the way down.

4
00:00:17, 352 --> 00:00:18, 353
I fe'll and we'll go again to the FBI.

5
00:00:19, 377 --> 00:00:20, 378
What is the CSI?

5
00:00:20, 378 --> 00:00:21, 379
I will go to see API demo.

Upvotes: 0

Tim Biegeleisen
Tim Biegeleisen

Reputation: 522516

You could try matching on the following regex pattern:

(?<=[^\sA-Z])I\b

This pattern says to match:

  • (?<=[^\sA-Z]) assert that what precedes is NOT whitespace (exclude 'I') or another capital (exclude 'FBI' and other acronyms)
  • I match 'I'
  • \b word boundary follows

Updated script:

output = re.sub(r'(?<=[^\sA-Z])I\b', r'l', inp)
print(output)

This prints:

*1
00:00:14, 391 --> 00:00:15, 976
He he'll  crawl Interdimensional,

2
00:00:17, 352 --> 00:00:18, 353
Who, are you a bell? I? I am on a hill,

3
00:00:17, 352 --> 00:00:18, 353
I walking up hi'll, and roll all the way down.

4
00:00:17, 352 --> 00:00:18, 353
I fe'll and we'll go again to the FBI.*

Upvotes: 1

Related Questions