Clauric
Clauric

Reputation: 1886

Replace character that is not recognised by encoding

I have a large file that I'm trying to import. The file is made up of millions of row of customer created data. As such, some users have used characters that are not recognised by the encoding (less than 1 character per 100,000 characters).

However, this is causing the code to break, as it doesn't recognise the character, and giving me the following error:

UnicodeEncodeError: 'charmap' codec can't encode character '\x96' in position 619: character maps to <undefined>

In the specific case above, the encoding doesn't recognise the long hyphen.

The code I am currently using to read the file, and conduct some transformation is:

def conversion(path, source, count):
    file = open(path, "w")
    iFile = open(source, 'r', encoding="utf-8")
    len_text = 1
    file.write("[\n")

    for line in iFile:                          # For all the lines in the file
        line = line.strip()                     # Remove newline/whitespace from begin and end of line
        line = line.replace('"newDetails":{','')
        line = line.replace('},"addrDate"',',"addrDate"')
        line = line.replace('},"open24Id"',',"open24Id"')

        if len_text != count:                   # While len_text does not equal line_count
            line+= r","                         # Add , to end of the line
            line+= "\n"                         # Add \n to end of line
            file.write(line)                    # Write line to file
        else:
            line += "\n"                        # Add \n to end of line
            file.write(line)                    # Write line to file

        len_text += 1                           # Increment len_text by 1

    file.write("]")                             # Write ] to end of file
    file.close()                                # Close file
    return 

The break occurs on file.write(line).

How can I tell the script to search for, and replace the character \x96 with another character?

Upvotes: 2

Views: 524

Answers (1)

Tommy Lawrence
Tommy Lawrence

Reputation: 310

Based on my comment: A try will catch the errored part of the message, the except is how you deal with that, so if you said

try:
    your code
except UnicodeEncodeError:
    break

would skip it, but doing something like

try:
    your code
except UnicodeEncodeError:
    file.write("Your character")

That will allow you to use your code, and when it hits that error, it will replace it with the character you want to replace it with. Play with the code to change it to how you want it to work, I just did a generic example.

Upvotes: 2

Related Questions