david yeritsyan
david yeritsyan

Reputation: 452

Opening a text file, and receiving a encoding error, tried multiple methods no hope

I'm trying to open up a password database file (consists of a bunch of common passwords) and I'm getting the following error:

Attempts so far.. Code:

f = open("crackstation-human-only.txt", 'r')

for i in f:
    print(i)

Error Code:

Traceback (most recent call last):
  File "C:\Users\David\eclipse-workspace\Kaplin\password_cracker.py", line 3, in <module>
    for i in f:
  File "C:\Users\David\AppData\Local\Programs\Python\Python37\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 753: character maps to <undefined>

After doing some research I was told to attempt encoding = 'utf-8' which I later discovered was basically guessing and hoping that the file would show all the outputs

Code:

f = open("crackstation-human-only.txt", 'r', encoding = 'utf-8')

for i in f:
    print(i)

Error:

Traceback (most recent call last):
  File "C:\Users\David\eclipse-workspace\Kaplin\password_cracker.py", line 3, in <module>
    for i in f:
  File "C:\Users\David\AppData\Local\Programs\Python\Python37\lib\codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 5884: invalid continuation byte

After receiving this error message, I was recommended to attempt to download a text editor like 'Sublime Text 3', and to open the console end enter the command 'Encoding()', but unfortunately it wasn't able to detect the encoding.

My professor was able to use bash to 'grep cat' the lines in the file (I honestly know very little about bash so if anyone else knows those terms i'm not sure if that will help them out)

If anyone has any suggestions on what I can do in order to get this to work out I would greatly appreciate it.

I will post the link to the text document if anyone is interested in seeing what types of characters are within the file.

Link to the file, it's a .txt from my school/professors domain

UPDATE:

I have a fellow classmate that is running elementary OS, and he was using the terminal to write his python program which would iterate through the file, and he was using the encoding 'latin-1', he was able to output more characters than me, I'm on Windows 10, using Eclipse-atom for all my scripts.

So there seems to be something that's causing me possibly not to get the correct outputs based on these factors, i'm guessing because it just seems that way based on the results,

I will be installing elementary-os and attempting all the solutions there, to see if I can get this file to work out. I'll add another update soon!

Upvotes: 1

Views: 3265

Answers (2)

gavin
gavin

Reputation: 892

Faced a similar problem a while ago, and more often I've found that setting

encoding = 'raw_unicode_escape'

has worked for me

For your particular case, I tried all Python 3 supported encoding types and found

  • raw_unicode_escape
  • mbcs
  • palmos

Try either of the above to read your file

f = open("crackstation-human-only.txt", 'r', encoding = 'mbcs')

For more information on encodings, refer https://docs.python.org/2.4/lib/standard-encodings.html

Hope this helps.

re: With the link above i made a list of encoding formats to try on your file. I hadn't saved my previous work, which was more in detail, but this code should do the same. I re-ran it now as follows:

enc_list = ['big5big5-tw,',
 'cp037IBM037,',
 'cp437437,',
 'cp737Greek',
 'cp850850,',
 'cp855855,',
 'cp857857,',
 'cp861861,',
 'cp863863,',
 'cp865865,',
 'cp869869,',
 'cp875Greek',
 'cp949949,',
 'cp1006Urdu',
 'cp1140ibm1140Western',
 'cp1251windows-1251Bulgarian,',
 'cp1253windows-1253Greek',
 'cp1255windows-1255Hebrew',
 'cp1257windows-1257Baltic',
 'euc_jpeucjp,',
 'euc_jisx0213eucjisx0213Japanese',
 'gb2312chinese,',
 'gb18030gb18030-2000Unified',
 'iso2022_jpcsiso2022jp,',
 'iso2022_jp_2iso2022jp-2,',
 'iso2022_jp_3iso2022jp-3,',
 'iso2022_krcsiso2022kr,',
 'iso8859_2iso-8859-2,',
 'iso8859_4iso-8859-4,',
 'iso8859_6iso-8859-6,',
 'iso8859_8iso-8859-8,',
 'iso8859_10iso-8859-10',
 'iso8859_14iso-8859-14,',
 'johabcp1361,',
 'koi8_uUkrainian',
 'mac_greekmacgreekGreek',
 'mac_latin2maclatin2,',
 'mac_turkishmacturkishTurkish',
 'shift_jiscsshiftjis,',
 'shift_jisx0213shiftjisx0213,',
 'utf_16_beUTF-16BEall',
 'utf_16_le',
 'utf_7',
 'utf_8',
 'base64_codec',
 'bz2_codec',
 'hex_codec',
 'idna',
 'mbcs',
 'palmos',
 'punycode',
 'quopri_codec',
 'raw_unicode_escape',
 'rot_13',
 'string_escape',
 'undefined',
 'unicode_escape',
 'unicode_internal',
 'uu_codec',
 'zlib_codec'
 ]

for encode in enc_list: 
    try:
        with open(r"crackstation-human-only.txt", encoding=encode) as f:
            temp = len(f.read())
    except:
        enc_list.remove(encode)

print(enc_list)     

Run this code on your machine and you'll get a list of encodings you can try on your file. The output i received was

['cp037IBM037,', 'cp737Greek', 'cp855855,', 'cp861861,', 'cp865865,', 'cp875Greek', 'cp1006Urdu', 'cp1251windows-1251Bulgarian,', 'cp1255windows-1255Hebrew', 'euc_jpeucjp,', 'gb2312chinese,', 'iso2022_jpcsiso2022jp,', 'iso2022_jp_3iso2022jp-3,', 'iso8859_2iso-8859-2,', 'iso8859_6iso-8859-6,', 'iso8859_10iso-8859-10', 'johabcp1361,', 'mac_greekmacgreekGreek', 'mac_turkishmacturkishTurkish', 'shift_jisx0213shiftjisx0213,', 'utf_16_le', 'utf_8', 'bz2_codec', 'idna', 'mbcs', 'palmos', 'quopri_codec', 'raw_unicode_escape', 'string_escape', 'unicode_escape', 'uu_codec']

Upvotes: 3

dummman
dummman

Reputation: 189

You do have some interesting characters in there. Even though your code does work for me, I'd suggest using a try/except block to catch the lines your system can't handle and skip them:

with open("crackstation-human-only.txt", 'r') as f:
    for i in f:
        try:
            print(i)
        except UnicodeDecodeError:
            continue

Alternatively, try using open with

  • the binary read mode 'rb' instead of 'r'
  • the errors='replace' argument, but that will not do what you want.

see the open documentation

Upvotes: 0

Related Questions