mdrnjss
mdrnjss

Reputation: 21

'latin-1' codec can't encode character '\u2019'

I'm exporting feeds from alien vault otx using staxii and trying to send them to misp. But when sending some feeds, the following error occurs:

UnicodeEncodeError: 'latin-1' codec can't encode character '\u2019' in position 3397: Body ('’') is not valid Latin-1. Use body.encode('utf-8') if you want to send it encoded in UTF-8.

for filename in os.listdir(dest_directory):
filenameWithDir = dest_directory+filename
try:
    file_index += 1
    print("****************")
    print(dest_directory + filename)
    print(file_index)
    print("****************")
    misp_config.upload_stix(filenameWithDir, '1')
except UnicodeEncodeError:
    with open(filenameWithDir, 'r') as file:
        filedata = file.read()
        filedata = filedata.replace('вЂ', ' ').replace('’', ' ').replace('“', ' ').replace('”', ' ')\
            .replace('–', ' ').replace('—', ' ').replace('™', ' ').replace('​', ' ').replace(' ', ' ')\
            .replace(' ', ' ').replace('…', ' ').replace(' ', ' ').replace('미북 정상회담 전망 및 대비', ' ')\
            .replace(',', ' ').replace('•', ' ').replace('‑', ' ')

    with open(filenameWithDir, 'w') as file:
        file.write(filedata)
    file_index += 1
    print("****************")
    print(dest_directory + filename)
    print(file_index)
    print("****************")
    misp_config.upload_stix(filenameWithDir, '1')

I tried to replace characters that are not readable, but there are too many of them. Is it possible to delete characters by the position indicated in the error?

Upvotes: 2

Views: 13774

Answers (1)

nd.
nd.

Reputation: 8932

This is basically a Unicode-problem, that would happen in any unicode-aware language. Fundamentals:

  • Unicode is a standard that aims to define a single well known code (and name) for any writing system known.
  • An encoding is how Unicode code points ("characters") are stored and transmitted using one or more bytes.

There are encodings that make it possible to store any random Unicode code point (e.g. UTF-8, UTF-16) als well as encodings that permit only a subset of Unicode code points - e.g the ISO 8859-1 (aka Latin-1) encoding that supports only a small superset of ASCII.

Python translates between Unicode data (str) and byte data (bytearray) using .encode (for str → bytearray) and .decode (for bytearray → str). Your code (or something that is called by your code) apparently uses .encode('latin-1'), but this encoding fails for the Right Single Quotation Mark \u2019 as Latin-1 does not support this character.

You can use another encoding to send that character. The UTF-8 encoding is a good choice for that, but your counterpart MUST be configured to use this encoding as well, otherwise you will receive some Mojibake where the other side interprets your UTF-8 data as Latin-1 and your character could show up up as ’.

If you are using Windows, it is likely that your source-data was using Windows-1252 instead of Latin-1 – this encoding is quite similar and has an encoding for your Right Single Quotation Mark, so maybe Windows-1252 could be a better choice of encoding.

Upvotes: 3

Related Questions