Reputation: 21
I'm exporting feeds from alien vault otx using staxii
and trying to send them to misp
. But when sending some feeds, the following error occurs:
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2019' in position 3397: Body ('’') is not valid Latin-1. Use body.encode('utf-8') if you want to send it encoded in UTF-8.
for filename in os.listdir(dest_directory):
filenameWithDir = dest_directory+filename
try:
file_index += 1
print("****************")
print(dest_directory + filename)
print(file_index)
print("****************")
misp_config.upload_stix(filenameWithDir, '1')
except UnicodeEncodeError:
with open(filenameWithDir, 'r') as file:
filedata = file.read()
filedata = filedata.replace('вЂ', ' ').replace('’', ' ').replace('“', ' ').replace('”', ' ')\
.replace('–', ' ').replace('—', ' ').replace('™', ' ').replace('​', ' ').replace(' ', ' ')\
.replace(' ', ' ').replace('…', ' ').replace('гЂЂ', ' ').replace('лЇёл¶Ѓ м •мѓЃнљЊл‹ґ м „л§ќ л°Џ 대비', ' ')\
.replace(',', ' ').replace('•', ' ').replace('‑', ' ')
with open(filenameWithDir, 'w') as file:
file.write(filedata)
file_index += 1
print("****************")
print(dest_directory + filename)
print(file_index)
print("****************")
misp_config.upload_stix(filenameWithDir, '1')
I tried to replace characters that are not readable, but there are too many of them. Is it possible to delete characters by the position indicated in the error?
Upvotes: 2
Views: 13774
Reputation: 8932
This is basically a Unicode-problem, that would happen in any unicode-aware language. Fundamentals:
There are encodings that make it possible to store any random Unicode code point (e.g. UTF-8, UTF-16) als well as encodings that permit only a subset of Unicode code points - e.g the ISO 8859-1 (aka Latin-1) encoding that supports only a small superset of ASCII.
Python translates between Unicode data (str
) and byte data (bytearray
) using .encode
(for str → bytearray) and .decode
(for bytearray → str). Your code (or something that is called by your code) apparently uses .encode('latin-1')
, but this encoding fails for the Right Single Quotation Mark \u2019
as Latin-1 does not support this character.
You can use another encoding to send that character. The UTF-8 encoding is a good choice for that, but your counterpart MUST be configured to use this encoding as well, otherwise you will receive some Mojibake where the other side interprets your UTF-8 data as Latin-1 and your character ’
could show up up as ’
.
If you are using Windows, it is likely that your source-data was using Windows-1252 instead of Latin-1 – this encoding is quite similar and has an encoding for your Right Single Quotation Mark, so maybe Windows-1252 could be a better choice of encoding.
Upvotes: 3