Reputation: 371
I want to read all files from a folder (with os.walk
) and convert them to one encoding (UTF-8). The problem is those files don't have same encoding. They could be UTF-8, UTF-8 with BOM, UTF-16.
Is there any way to do read those files without knowing their encoding?
Upvotes: 5
Views: 4701
Reputation: 86
You can read those files in binary mode. Also, the chardet library can help you detect character encoding. Using chardet, you can detect the encoding of your files and decode the data you get. Though this module has limitations.
As an example:
from chardet import detect
with open('your_file.txt', 'rb') as ef:
detect(ef.read())
Upvotes: 6
Reputation: 5741
If it is indeed always one of these 3 then it is easy. If you can read the file using UTF-8 then it is probably UTF-8. Otherwise it will be UTF-16. Python can also automatically discard the BOM if present.
You can use a try ... except block to try both:
try:
tryToConvertMyFile(from, to, 'utf-8-sig')
except UnicodeDecodeError:
tryToConvertMyFile(from, to, 'utf-16')
If other encodings are present as well (like ISO-8859-1) then forget it, there is no 100% reliable way of figuring out the encoding. But you can guess—see for example Is there a Python library function which attempts to guess the character-encoding of some bytes?
Upvotes: 0