Reputation: 452
I'm trying to open up a password database file (consists of a bunch of common passwords) and I'm getting the following error:
Attempts so far.. Code:
f = open("crackstation-human-only.txt", 'r')
for i in f:
print(i)
Error Code:
Traceback (most recent call last):
File "C:\Users\David\eclipse-workspace\Kaplin\password_cracker.py", line 3, in <module>
for i in f:
File "C:\Users\David\AppData\Local\Programs\Python\Python37\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 753: character maps to <undefined>
After doing some research I was told to attempt encoding = 'utf-8'
which I later discovered was basically guessing and hoping that the file would show all the outputs
Code:
f = open("crackstation-human-only.txt", 'r', encoding = 'utf-8')
for i in f:
print(i)
Error:
Traceback (most recent call last):
File "C:\Users\David\eclipse-workspace\Kaplin\password_cracker.py", line 3, in <module>
for i in f:
File "C:\Users\David\AppData\Local\Programs\Python\Python37\lib\codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 5884: invalid continuation byte
After receiving this error message, I was recommended to attempt to download a text editor like 'Sublime Text 3', and to open the console end enter the command 'Encoding()', but unfortunately it wasn't able to detect the encoding.
My professor was able to use bash to 'grep cat' the lines in the file (I honestly know very little about bash so if anyone else knows those terms i'm not sure if that will help them out)
If anyone has any suggestions on what I can do in order to get this to work out I would greatly appreciate it.
I will post the link to the text document if anyone is interested in seeing what types of characters are within the file.
Link to the file, it's a .txt from my school/professors domain
UPDATE:
I have a fellow classmate that is running elementary OS, and he was using the terminal to write his python program which would iterate through the file, and he was using the encoding 'latin-1', he was able to output more characters than me, I'm on Windows 10, using Eclipse-atom for all my scripts.
So there seems to be something that's causing me possibly not to get the correct outputs based on these factors, i'm guessing because it just seems that way based on the results,
I will be installing elementary-os and attempting all the solutions there, to see if I can get this file to work out. I'll add another update soon!
Upvotes: 1
Views: 3265
Reputation: 892
Faced a similar problem a while ago, and more often I've found that setting
encoding = 'raw_unicode_escape'
has worked for me
For your particular case, I tried all Python 3 supported encoding types and found
Try either of the above to read your file
f = open("crackstation-human-only.txt", 'r', encoding = 'mbcs')
For more information on encodings, refer https://docs.python.org/2.4/lib/standard-encodings.html
Hope this helps.
re: With the link above i made a list of encoding formats to try on your file. I hadn't saved my previous work, which was more in detail, but this code should do the same. I re-ran it now as follows:
enc_list = ['big5big5-tw,',
'cp037IBM037,',
'cp437437,',
'cp737Greek',
'cp850850,',
'cp855855,',
'cp857857,',
'cp861861,',
'cp863863,',
'cp865865,',
'cp869869,',
'cp875Greek',
'cp949949,',
'cp1006Urdu',
'cp1140ibm1140Western',
'cp1251windows-1251Bulgarian,',
'cp1253windows-1253Greek',
'cp1255windows-1255Hebrew',
'cp1257windows-1257Baltic',
'euc_jpeucjp,',
'euc_jisx0213eucjisx0213Japanese',
'gb2312chinese,',
'gb18030gb18030-2000Unified',
'iso2022_jpcsiso2022jp,',
'iso2022_jp_2iso2022jp-2,',
'iso2022_jp_3iso2022jp-3,',
'iso2022_krcsiso2022kr,',
'iso8859_2iso-8859-2,',
'iso8859_4iso-8859-4,',
'iso8859_6iso-8859-6,',
'iso8859_8iso-8859-8,',
'iso8859_10iso-8859-10',
'iso8859_14iso-8859-14,',
'johabcp1361,',
'koi8_uUkrainian',
'mac_greekmacgreekGreek',
'mac_latin2maclatin2,',
'mac_turkishmacturkishTurkish',
'shift_jiscsshiftjis,',
'shift_jisx0213shiftjisx0213,',
'utf_16_beUTF-16BEall',
'utf_16_le',
'utf_7',
'utf_8',
'base64_codec',
'bz2_codec',
'hex_codec',
'idna',
'mbcs',
'palmos',
'punycode',
'quopri_codec',
'raw_unicode_escape',
'rot_13',
'string_escape',
'undefined',
'unicode_escape',
'unicode_internal',
'uu_codec',
'zlib_codec'
]
for encode in enc_list:
try:
with open(r"crackstation-human-only.txt", encoding=encode) as f:
temp = len(f.read())
except:
enc_list.remove(encode)
print(enc_list)
Run this code on your machine and you'll get a list of encodings you can try on your file. The output i received was
['cp037IBM037,', 'cp737Greek', 'cp855855,', 'cp861861,', 'cp865865,', 'cp875Greek', 'cp1006Urdu', 'cp1251windows-1251Bulgarian,', 'cp1255windows-1255Hebrew', 'euc_jpeucjp,', 'gb2312chinese,', 'iso2022_jpcsiso2022jp,', 'iso2022_jp_3iso2022jp-3,', 'iso8859_2iso-8859-2,', 'iso8859_6iso-8859-6,', 'iso8859_10iso-8859-10', 'johabcp1361,', 'mac_greekmacgreekGreek', 'mac_turkishmacturkishTurkish', 'shift_jisx0213shiftjisx0213,', 'utf_16_le', 'utf_8', 'bz2_codec', 'idna', 'mbcs', 'palmos', 'quopri_codec', 'raw_unicode_escape', 'string_escape', 'unicode_escape', 'uu_codec']
Upvotes: 3
Reputation: 189
You do have some interesting characters in there. Even though your code does work for me, I'd suggest using a try
/except
block to catch the lines your system can't handle and skip them:
with open("crackstation-human-only.txt", 'r') as f:
for i in f:
try:
print(i)
except UnicodeDecodeError:
continue
Alternatively, try using open
with
'rb'
instead of 'r'
errors='replace'
argument, but that will not do what you want.see the open
documentation
Upvotes: 0