Julia Fasick
Julia Fasick

Reputation: 131

Using .replace effectively on text

I'm attempting to capitalize all words in a section of text that only appear once. I have the bit that finds which words only appear once down, but when I go to replace the original word with the .upper version, a bunch of other stuff gets capitalized too. It's a small program, so here's the code.

from collections import Counter
from string import punctuation

 path = input("Path to file: ")
 with open(path) as f:
    word_counts = Counter(word.strip(punctuation) for line in f for word in line.replace(")", " ").replace("(", " ")
                      .replace(":", " ").replace("", " ").split())

wordlist = open(path).read().replace("\n", " ").replace(")", " ").replace("(", " ").replace("", " ")

unique = [word for word, count in word_counts.items() if count == 1]

for word in unique:
    print(word)
    wordlist = wordlist.replace(word, str(word.upper()))

print(wordlist)

The output should be 'Genesis 37:1 Jacob lived in the land of his father's SOJOURNINGS, in the land of Canaan., as sojournings is the first word that only appears once. Instead, it outputs GenesIs 37:1 Jacob lIved In the land of hIs FATher's SOJOURNINGS, In the land of Canaan. Because some of the other letters appear in keywords, it tries to capitalize them as well.

Any ideas?

Upvotes: 0

Views: 58

Answers (2)

Patrick Artner
Patrick Artner

Reputation: 51653

Text replacement by patters calls for regex.

Your text is a bit tricky, you have to

  • remove digits
  • remove punktuations
  • split into words
  • care about capitalisation: 'It's' vs 'it's'
  • only replace full matches 'remote' vs 'mote' when replacing mote
  • etc.

This should do this - see comments inside for explanations:

bible.txt is from your link

from collections import Counter
from string import punctuation , digits

import re

from collections import defaultdict

with open(r"SO\AllThingsPython\P4\bible.txt") as f:
    s = f.read()

# get a set of unwanted characters and clean the text
ps = set(punctuation + digits)  
s2 = ''.join( c for c in s if c not in ps) 

# split into words
s3 = s2.split()

# create a set of all capitalizations of each word
repl = defaultdict(set)
for word in s3:
    repl[word.upper()].add(word)  # f.e. {..., 'IN': {'In', 'in'}, 'THE': {'The', 'the'}, ...}

# count all words _upper case_ and use those that only occure once
single_occurence_upper_words = [w for w,n in Counter( (w.upper() for w in s3) ).most_common() if n == 1]
text = s

# now the replace part - for all upper single words 
for upp in single_occurence_upper_words:

    # for all occuring capitalizations in the text
    for orig in repl[upp]:

        # use regex replace to find the original word from our repl dict with 
        # space/punktuation before/after it and replace it with the uppercase word
        text = re.sub(f"(?<=[{punctuation} ])({orig})(?=[{punctuation} ])",upp, text)

print(text)

Output (shortened):

Genesis 37:1 Jacob lived in the land of his father's SOJOURNINGS, in the land of Canaan.

2 These are the GENERATIONS of Jacob.

Joseph, being seventeen years old, was pasturing the flock with his brothers. He was a boy with the sons of Bilhah and Zilpah, his father's wives. And Joseph brought a BAD report of them to their father. 3 Now Israel loved Joseph more than any other of his sons, because he was the son of his old age. And he made him a robe of many colors. [a] 4 But when his brothers saw that their father loved him more than all his brothers, they hated him
and could not speak PEACEFULLY to him. 

<snipp>

The regex uses lookahead '(?=...)' and lookbehind '(?<=...)'syntax to make sure we replace only full words, see regex syntax.

Upvotes: 0

Benjamin
Benjamin

Reputation: 546

I rewrote the code pretty significantly since some of the chained replace calls might prove to be unreliable.

import string

# The sentence.
sentence = "Genesis 37:1 Jacob lived in the land of his father's SOJOURNINGS, in the land of Canaan."

rm_punc = sentence.translate(None, string.punctuation)  # remove punctuation
words = rm_punc.split(' ')  # split spaces to get a list of words

# Find all unique word occurrences.
single_occurrences = []
for word in words:
    # if word only occurs 1 time, append it to the list
    if words.count(word) == 1:
        single_occurrences.append(word)

# For each unique word, find it's index and capitalize the letter at that index
# in the initial string (the letter at that index is also the first letter of
# the word). Note that strings are immutable, so we are actually creating a new
# string on each iteration. Also, sometimes small words occur inside of other
# words, e.g. 'an' inside of 'land'. In order to make sure that our call to
# `index()` doesn't find these small words, we keep track of `start` which
# makes sure we only ever search from the end of the previously found word.
start = 0
for word in single_occurrences:
    try:
        word_idx = start + sentence[start:].index(word)
    except ValueError:
        # Could not find word in sentence. Skip it.
        pass
    else:
        # Update counter.
        start = word_idx + len(word)

        # Rebuild sentence with capitalization.
        first_letter = sentence[word_idx].upper()
        sentence = sentence[:word_idx] + first_letter + sentence[word_idx+1:]

print(sentence)

Upvotes: 1

Related Questions