Decode unknown string

Question

I have one source of data, that I don't control, and that sends strings with different encodings, and I have no way to know the encoding in advance! I would need to know the format to be able to correctly decode and store properly in a format that I understand and control, let's say UTF-8.

for example:

"J'ai dÃ©jÃ\xa0 un problÃ¨me, aprÃ¨s... je ne sais pas"

should read

"J'ai déjà un problème, après... je ne sais pas"

What I have tried:

> stringToTest="J'ai dÃ©jÃ\xa0 un problÃ¨me, aprÃ¨s... je ne sais pas"
# there is no decode for string, directly, but one can try
> stringToTest.encode().decode()
"J'ai dÃ©jÃ\xa0 un problÃ¨me, aprÃ¨s... je ne sais pas"
# what does not help :)

From trial and error, I found that the encoding is 'iso-8859-1'

> stringToTest.encode('iso-8859-1').decode()
"J'ai déjà un problème, après... je ne sais pas"

What I want/need is to find the 'iso-8859-1' automatically!

I tried to use chardet!

> import chardet

> chardet.detect(stringToTest)
Traceback (most recent call last):
  File "/snap/pycharm-community/188/plugins/python-ce/helpers/pydev/_pydevd_bundle/pydevd_exec2.py", line 3, in Exec
    exec(exp, global_vars, local_vars)
  File "", line 1, in 
  File "/usr/lib/python3/dist-packages/chardet/__init__.py", line 34, in detect
    '{0}'.format(type(byte_str)))
TypeError: Expected object of type bytes or bytearray, got:

But... as it is a string... chardet does not accept it! And, I am ashamed to admit, but I don't manage to convert the string into something that chardet accepts!

> test1=b"J'ai dÃ©jÃ un problÃ¨me, aprÃ¨s... je ne sais pas"
  File "", line 1
SyntaxError: bytes can only contain ASCII literal characters.

# Ok str and unicode are similar things... but who knows?!?!
> test1=u"J'ai dÃ©jÃ un problÃ¨me, aprÃ¨s... je ne sais pas"
> chardet.detect(test1)
Traceback (most recent call last):
  File "/snap/pycharm-community/188/plugins/python-ce/helpers/pydev/_pydevd_bundle/pydevd_exec2.py", line 3, in Exec
    exec(exp, global_vars, local_vars)
  File "", line 1, in 
  File "/usr/lib/python3/dist-packages/chardet/__init__.py", line 34, in detect
    '{0}'.format(type(byte_str)))
TypeError: Expected object of type bytes or bytearray, got: 

# NOP
> bytes(stringToTest)
Traceback (most recent call last):
  File "/snap/pycharm-community/188/plugins/python-ce/helpers/pydev/_pydevd_bundle/pydevd_exec2.py", line 3, in Exec
    exec(exp, global_vars, local_vars)
  File "", line 1, in 
TypeError: string argument without an encoding

Why not unidecode?!?

from unidecode import unidecode

from unidecode import unidecode
unidecode(stringToTest)
'J\'ai dA(c)jA un problA"me, aprA"s... je ne sais pas'

Decode unknown string

Answers (1)

Related Questions