Mine
Mine

Reputation: 861

UnicodeEncodeError: 'charmap' codec can't encode character '\x9f' in position 47: character maps to <undefined>

Below is the code that is supposed to convert bz2 to text format. However; I am getting a unicode error.Since I am using utf-8 I wonder what the error could be

from __future__ import print_function

import logging
import os.path
import six
import sys

from gensim.corpora import WikiCorpus

if __name__ == '__main__':
    program = os.path.basename(sys.argv[0])
    logger = logging.getLogger(program)

    logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))

    # check and process input arguments

    inp =  "trwiki-latest-pages-articles.xml.bz2"
    outp = "wiki_text_dump.txt"
    space = " "
    i = 0

    output = open(outp, 'w')
    wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
    for text in wiki.get_texts():
        if six.PY3:
            output.write(' '.join(text).encode().decode('unicode_escape') + '\n')
        #   ###another method###
        #    output.write(
        #            space.join(map(lambda x:x.decode("utf-8"), text)) + '\n')
        else:
            output.write(space.join(text) + "\n")
            #output.write(text)
        i = i + 1
        if (i % 10000 == 0):
            logger.info("Saved " + str(i) + " articles")

    output.close()
    logger.info("Finished Saved " + str(i) + " articles")

Error:

UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-42-9404745af31b> in <module>()
     32     for text in wiki.get_texts():
     33         if six.PY3:
---> 34             output.write(' '.join(text).encode().decode('unicode_escape') + '\n')
     35         #   ###another method###
     36         #    output.write(

c:\users\m\appdata\local\programs\python\python37\lib\encodings\cp1254.py in encode(self, input, final)
     17 class IncrementalEncoder(codecs.IncrementalEncoder):
     18     def encode(self, input, final=False):
---> 19         return codecs.charmap_encode(input,self.errors,encoding_table)[0]
     20 
     21 class IncrementalDecoder(codecs.IncrementalDecoder):

UnicodeEncodeError: 'charmap' codec can't encode character '\x9f' in position 47: character maps to <undefined>

I've also replaced "unicode_escape" with "utf-8" then I get this error

UnicodeEncodeError: 'charmap' codec can't encode characters in position 87-92: character maps to <undefined>

Upvotes: 0

Views: 1010

Answers (1)

Karl Knechtel
Karl Knechtel

Reputation: 61526

As the traceback indicates, the error occurs during the call to .encode, not during the call to .decode. Therefore you cannot fix the problem by changing the .decode codec.

Since the code is running in Python 3.x (six.PY3 is true - but why are you concerned with 2.x compatibility in new code written today?), and since ' '.join(text) worked, we conclude that text is either a string or a list of strings (not a bytes or list of bytes), and ' '.join(text) is a string. Indeed, the documentation tells us that WikiCorpus will already provide strings.

This string contains some character that your codec, cp1254.py (this is a Windows code page specially intended for Turkish text), cannot encode. It is not clear to me what you hope to accomplish by encoding and then decoding again. Just use the string. In fact, text should already be a single string that does not need any .joining (unless you wanted to put a space after each letter, for some reason). You should verify this for yourself by debugging.

Upvotes: 1

Related Questions