Replace character that is not recognised by encoding

Question

I have a large file that I'm trying to import. The file is made up of millions of row of customer created data. As such, some users have used characters that are not recognised by the encoding (less than 1 character per 100,000 characters).

However, this is causing the code to break, as it doesn't recognise the character, and giving me the following error:

UnicodeEncodeError: 'charmap' codec can't encode character '\x96' in position 619: character maps to

In the specific case above, the encoding doesn't recognise the long hyphen.

The code I am currently using to read the file, and conduct some transformation is:

def conversion(path, source, count):
    file = open(path, "w")
    iFile = open(source, 'r', encoding="utf-8")
    len_text = 1
    file.write("[
")

    for line in iFile:                          # For all the lines in the file
        line = line.strip()                     # Remove newline/whitespace from begin and end of line
        line = line.replace('"newDetails":{','')
        line = line.replace('},"addrDate"',',"addrDate"')
        line = line.replace('},"open24Id"',',"open24Id"')

        if len_text != count:                   # While len_text does not equal line_count
            line+= r","                         # Add , to end of the line
            line+= "
"                         # Add 
 to end of line
            file.write(line)                    # Write line to file
        else:
            line += "
"                        # Add 
 to end of line
            file.write(line)                    # Write line to file

        len_text += 1                           # Increment len_text by 1

    file.write("]")                             # Write ] to end of file
    file.close()                                # Close file
    return

The break occurs on file.write(line).

How can I tell the script to search for, and replace the character \x96 with another character?

Tommy Lawrence · Accepted Answer

Based on my comment: A try will catch the errored part of the message, the except is how you deal with that, so if you said

try:
    your code
except UnicodeEncodeError:
    break

would skip it, but doing something like

try:
    your code
except UnicodeEncodeError:
    file.write("Your character")

That will allow you to use your code, and when it hits that error, it will replace it with the character you want to replace it with. Play with the code to change it to how you want it to work, I just did a generic example.

Replace character that is not recognised by encoding

Answers (1)

Related Questions