Reputation: 861
Below is the code that is supposed to convert bz2 to text format. However; I am getting a unicode error.Since I am using utf-8 I wonder what the error could be
from __future__ import print_function
import logging
import os.path
import six
import sys
from gensim.corpora import WikiCorpus
if __name__ == '__main__':
program = os.path.basename(sys.argv[0])
logger = logging.getLogger(program)
logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
logging.root.setLevel(level=logging.INFO)
logger.info("running %s" % ' '.join(sys.argv))
# check and process input arguments
inp = "trwiki-latest-pages-articles.xml.bz2"
outp = "wiki_text_dump.txt"
space = " "
i = 0
output = open(outp, 'w')
wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
for text in wiki.get_texts():
if six.PY3:
output.write(' '.join(text).encode().decode('unicode_escape') + '\n')
# ###another method###
# output.write(
# space.join(map(lambda x:x.decode("utf-8"), text)) + '\n')
else:
output.write(space.join(text) + "\n")
#output.write(text)
i = i + 1
if (i % 10000 == 0):
logger.info("Saved " + str(i) + " articles")
output.close()
logger.info("Finished Saved " + str(i) + " articles")
Error:
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-42-9404745af31b> in <module>()
32 for text in wiki.get_texts():
33 if six.PY3:
---> 34 output.write(' '.join(text).encode().decode('unicode_escape') + '\n')
35 # ###another method###
36 # output.write(
c:\users\m\appdata\local\programs\python\python37\lib\encodings\cp1254.py in encode(self, input, final)
17 class IncrementalEncoder(codecs.IncrementalEncoder):
18 def encode(self, input, final=False):
---> 19 return codecs.charmap_encode(input,self.errors,encoding_table)[0]
20
21 class IncrementalDecoder(codecs.IncrementalDecoder):
UnicodeEncodeError: 'charmap' codec can't encode character '\x9f' in position 47: character maps to <undefined>
I've also replaced "unicode_escape" with "utf-8" then I get this error
UnicodeEncodeError: 'charmap' codec can't encode characters in position 87-92: character maps to <undefined>
Upvotes: 0
Views: 1010
Reputation: 61526
As the traceback indicates, the error occurs during the call to .encode
, not during the call to .decode
. Therefore you cannot fix the problem by changing the .decode
codec.
Since the code is running in Python 3.x (six.PY3
is true - but why are you concerned with 2.x compatibility in new code written today?), and since ' '.join(text)
worked, we conclude that text
is either a string or a list of strings (not a bytes
or list of bytes
), and ' '.join(text)
is a string. Indeed, the documentation tells us that WikiCorpus
will already provide strings.
This string contains some character that your codec, cp1254.py
(this is a Windows code page specially intended for Turkish text), cannot encode. It is not clear to me what you hope to accomplish by encoding and then decoding again. Just use the string. In fact, text
should already be a single string that does not need any .join
ing (unless you wanted to put a space after each letter, for some reason). You should verify this for yourself by debugging.
Upvotes: 1