Reputation: 463
I've looked at many different posts about replacing characters on this site and others, and I've done string replacing before. In this specific instance, however, I'm running into an unexpected issue. I'm hoping I'm just missing something obvious...
I'm trying to replace a list of special characters with their HTML entity codes. I've tried a few versions of this, from plaintext replacements (½
to ½
) to the last iteration, using byte-encoded strings (as suggested here)
The functionality of my code is pretty simple. I get the contents of a file:
with open(cur_file, 'r', encoding='utf-8') as file_handle:
file_contents = file_handle.read()
file_handle.close()
Then I call my 'replacer()' function:
good_text = replacer(file_contents)
Contents of replacer() function:
def replacer(text):
replace_chars = {
b'\xc2\xbd': '½', #½
b'\xe2\x80\x9c': '"', #“
b'\xe2\x80\x9d': '"', #”
b'\xe2\x80\x99': '´', #’
b'\xe2\x80\x93': '—', #–
b'\xc2\xa9': '©' #©
}
for k, v in replace_chars.items():
good_text = text.replace(k.decode('utf-8'), v)
print('replacing ' + k.decode('utf-8') + ' with ' + v)
return good_text
Then I save the new text back into the file:
with open(cur_file, 'w', encoding='utf-8') as file_handle:
file_handle.write(good_text)
file_handle.close()
print('Done!')
In the console, I run this and get:
replacing ½ with ½
replacing “ with "
replacing ” with "
replacing ’ with ´
replacing – with —
replacing © with ©
Done!
This is as expected. However the file I'm replacing the strings in has the following contents:
replace_chars = {
'½': '½',
'“': '"',
'”': '"',
'’': '´',
'–': '—',
'©': '©'
I would expect the file not to contain ½
or the other characters in the first column, but instead be similar to '©': '©'
Upvotes: 1
Views: 250
Reputation: 782148
Each time through your loop you're replacing from the original text, not the result of the previous replacement. So the final result is just the last replacement, not all of them.
Change the loop so you store the result back in the same variable.
for k, v in replace_chars.items():
text = text.replace(k.decode('utf-8'), v)
print('replacing ' + k.decode('utf-8') + ' with ' + v)
return text
Upvotes: 2