Hafiz Muhammad Shafiq
Hafiz Muhammad Shafiq

Reputation: 8678

cld2 causing invalid utf-8 character in python

I have wriiten a small script in python 2.7. I have also installed cld2 module, used to find language type in given string. I have run it on 1 file of common crawl data, some thing it gaves following exception

Traceback (most recent call last):
  File "/mnt/yarn/usercache/hadoop/appcache/application_1484650471346_0002/container_1484650471346_0002_01_000005/./warc_mapper_updated.py", line 81, in <module>
    lang_status = get_lang_info(src_code)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1484650471346_0002/container_1484650471346_0002_01_000005/./mapper_updated.py", line 30, in get_lang_info
    isReliable, textBytesFound, details = cld2.detect(txt)
  File "/usr/local/lib64/python2.7/site-packages/cld2/__init__.py", line 396, in detect
    cld_results.bytes_found))
ValueError: input contains invalid UTF-8 around byte 174 (of -1603881792)

Follwoing are corresponding code snippet

    txt = txt.encode('utf8')
    isReliable, textBytesFound, details = cld2.detect(txt)

Why this is happending. Is there some way to avoid invalid input to cld2. eg.g if there is some binary data ( invalid utf-8), then it should be skipped?

Upvotes: 3

Views: 1291

Answers (0)

Related Questions