Using .replace effectively on text

Question

I'm attempting to capitalize all words in a section of text that only appear once. I have the bit that finds which words only appear once down, but when I go to replace the original word with the .upper version, a bunch of other stuff gets capitalized too. It's a small program, so here's the code.

from collections import Counter
from string import punctuation

 path = input("Path to file: ")
 with open(path) as f:
    word_counts = Counter(word.strip(punctuation) for line in f for word in line.replace(")", " ").replace("(", " ")
                      .replace(":", " ").replace("", " ").split())

wordlist = open(path).read().replace("
", " ").replace(")", " ").replace("(", " ").replace("", " ")

unique = [word for word, count in word_counts.items() if count == 1]

for word in unique:
    print(word)
    wordlist = wordlist.replace(word, str(word.upper()))

print(wordlist)

The output should be 'Genesis 37:1 Jacob lived in the land of his father's SOJOURNINGS, in the land of Canaan., as sojournings is the first word that only appears once. Instead, it outputs GenesIs 37:1 Jacob lIved In the land of hIs FATher's SOJOURNINGS, In the land of Canaan. Because some of the other letters appear in keywords, it tries to capitalize them as well.

Any ideas?

Patrick Artner · Accepted Answer

Text replacement by patters calls for regex.

Your text is a bit tricky, you have to

remove digits
remove punktuations
split into words
care about capitalisation: 'It's' vs 'it's'
only replace full matches 'remote' vs 'mote' when replacing mote
etc.

This should do this - see comments inside for explanations:

bible.txt is from your link

from collections import Counter
from string import punctuation , digits

import re

from collections import defaultdict

with open(r"SO\AllThingsPython\P4\bible.txt") as f:
    s = f.read()

# get a set of unwanted characters and clean the text
ps = set(punctuation + digits)  
s2 = ''.join( c for c in s if c not in ps) 

# split into words
s3 = s2.split()

# create a set of all capitalizations of each word
repl = defaultdict(set)
for word in s3:
    repl[word.upper()].add(word)  # f.e. {..., 'IN': {'In', 'in'}, 'THE': {'The', 'the'}, ...}

# count all words _upper case_ and use those that only occure once
single_occurence_upper_words = [w for w,n in Counter( (w.upper() for w in s3) ).most_common() if n == 1]
text = s

# now the replace part - for all upper single words 
for upp in single_occurence_upper_words:

    # for all occuring capitalizations in the text
    for orig in repl[upp]:

        # use regex replace to find the original word from our repl dict with 
        # space/punktuation before/after it and replace it with the uppercase word
        text = re.sub(f"(?<=[{punctuation} ])({orig})(?=[{punctuation} ])",upp, text)

print(text)

Output (shortened):

Genesis 37:1 Jacob lived in the land of his father's SOJOURNINGS, in the land of Canaan.

2 These are the GENERATIONS of Jacob.

Joseph, being seventeen years old, was pasturing the flock with his brothers. He was a boy with the sons of Bilhah and Zilpah, his father's wives. And Joseph brought a BAD report of them to their father. 3 Now Israel loved Joseph more than any other of his sons, because he was the son of his old age. And he made him a robe of many colors. [a] 4 But when his brothers saw that their father loved him more than all his brothers, they hated him
and could not speak PEACEFULLY to him.

The regex uses lookahead '(?=...)' and lookbehind '(?<=...)'syntax to make sure we replace only full words, see regex syntax.

Using .replace effectively on text

Answers (2)

Related Questions