Reputation: 8678
I have wriiten a small script in python 2.7. I have also installed cld2 module, used to find language type in given string. I have run it on 1 file of common crawl data, some thing it gaves following exception
Traceback (most recent call last):
File "/mnt/yarn/usercache/hadoop/appcache/application_1484650471346_0002/container_1484650471346_0002_01_000005/./warc_mapper_updated.py", line 81, in <module>
lang_status = get_lang_info(src_code)
File "/mnt/yarn/usercache/hadoop/appcache/application_1484650471346_0002/container_1484650471346_0002_01_000005/./mapper_updated.py", line 30, in get_lang_info
isReliable, textBytesFound, details = cld2.detect(txt)
File "/usr/local/lib64/python2.7/site-packages/cld2/__init__.py", line 396, in detect
cld_results.bytes_found))
ValueError: input contains invalid UTF-8 around byte 174 (of -1603881792)
Follwoing are corresponding code snippet
txt = txt.encode('utf8')
isReliable, textBytesFound, details = cld2.detect(txt)
Why this is happending. Is there some way to avoid invalid input to cld2. eg.g if there is some binary data ( invalid utf-8), then it should be skipped?
Upvotes: 3
Views: 1291