Kaboodleschmitt
Kaboodleschmitt

Reputation: 463

String.replace() with special characters only replacing some of them

I've looked at many different posts about replacing characters on this site and others, and I've done string replacing before. In this specific instance, however, I'm running into an unexpected issue. I'm hoping I'm just missing something obvious...

I'm trying to replace a list of special characters with their HTML entity codes. I've tried a few versions of this, from plaintext replacements (½ to ½) to the last iteration, using byte-encoded strings (as suggested here)

The functionality of my code is pretty simple. I get the contents of a file:

with open(cur_file, 'r', encoding='utf-8') as file_handle:
    file_contents = file_handle.read()
file_handle.close()

Then I call my 'replacer()' function:

good_text = replacer(file_contents)

Contents of replacer() function:

def replacer(text):
    replace_chars = {
        b'\xc2\xbd': '½',    #½
        b'\xe2\x80\x9c': '"',  #“
        b'\xe2\x80\x9d': '"',  #”
        b'\xe2\x80\x99': '´', #’
        b'\xe2\x80\x93': '—', #–
        b'\xc2\xa9': '©'       #©
    }
    
    for k, v in replace_chars.items():
        good_text = text.replace(k.decode('utf-8'), v)
        print('replacing ' + k.decode('utf-8') + ' with ' + v)
    return good_text

Then I save the new text back into the file:

    with open(cur_file, 'w', encoding='utf-8') as file_handle:
        file_handle.write(good_text)
    file_handle.close()
    
    print('Done!')

In the console, I run this and get:

replacing ½ with ½
replacing “ with "
replacing ” with "
replacing ’ with ´
replacing – with —
replacing © with ©
Done!

This is as expected. However the file I'm replacing the strings in has the following contents:

replace_chars = {
        '½': '½',
        '“': '"',
        '”': '"',
        '’': '´',
        '–': '—',
        '©': '©'

I would expect the file not to contain ½ or the other characters in the first column, but instead be similar to '©': '©'

Upvotes: 1

Views: 250

Answers (1)

Barmar
Barmar

Reputation: 782148

Each time through your loop you're replacing from the original text, not the result of the previous replacement. So the final result is just the last replacement, not all of them.

Change the loop so you store the result back in the same variable.

    for k, v in replace_chars.items():
        text = text.replace(k.decode('utf-8'), v)
        print('replacing ' + k.decode('utf-8') + ' with ' + v)
    return text

Upvotes: 2

Related Questions