Reputation: 239810
I need to convert some files to UTF-8 because they're being outputted in an otherwise UTF-8 site and the content looks a little fugly at times.
I can either do this now or I can do it as they're read in (through PHP, just using fopen, nothing fancy). Any suggestions welcome.
Upvotes: 2
Views: 4987
Reputation: 1112
I don't have a clear solution for PHP, but for Python I personally used Universal Encoding Detector library which does a pretty good job at guessing what encoding the file is being written as.
Just to get you started, here's a Python script that I had used to do the conversion (the original purpose is that I wanted to converted a Japanese code base from a mixture of UTF-16 and Shift-JIS, which I made a default guess if chardet is not confident of detecting the encoding):
import sys
import codecs
import chardet
from chardet.universaldetector import UniversalDetector
""" Detects encoding
Returns chardet result"""
def DetectEncoding(fileHdl):
detector = UniversalDetector()
for line in fileHdl:
detector.feed(line)
if detector.done: break
detector.close()
return detector.result
""" Reencode file to UTF-8
"""
def ReencodeFileToUtf8(fileName, encoding):
#TODO: This is dangerous ^^||, would need a backup option :)
#NOTE: Use 'replace' option which tolerates errorneous characters
data = codecs.open(fileName, 'rb', encoding, 'replace').read()
open(fileName, 'wb').write(data.encode('utf-8', 'replace'))
""" Main function
"""
if __name__=='__main__':
# Check for arguments first
if len(sys.argv) <> 2:
sys.exit("Invalid arguments supplied")
fileName = sys.argv[1]
try:
# Open file and detect encoding
fileHdl = open(fileName, 'rb')
encResult = DetectEncoding(fileHdl)
fileHdl.close()
# Was it an empty file?
if encResult['confidence'] == 0 and encResult['encoding'] == None:
sys.exit("Possible empty file")
# Only attempt to reencode file if we are confident about the
# encoding and if it's not UTF-8
encoding = encResult['encoding'].lower()
if encResult['confidence'] >= 0.7:
if encoding != 'utf-8':
ReencodeFileToUtf8(fileName, encoding)
else:
# TODO: Probably you could make a default guess and try to encode, or
# just simply make it fail
except IOError:
sys.exit('An IOError occured')
Upvotes: 7
Reputation: 132978
Can a file contain data from different codepages?
If yes, then you can't do the batch conversion at all. You would have to know every single codepage of every single sub string in your file.
If no it's possible to batch convert a file at a time, but assuming you know what codepage that file has. So we're more or less back the same situation as above, we've just moved the abstraction from sub string scope to file scope.
So, the question you need to ask yourself is. Do you have information about what codepage some data belongs to? If not, it will still look fugly.
You can always do some analysis on your data and guess codepage, and although this might make it a little less fuglier, you are still guessing, and therefore it will still be fugly :)
Upvotes: 1
Reputation:
My first attempt at this would be:
Upvotes: 2
Reputation: 346250
Doing it only once would improve performance and reduce the potential for future errors, but if you don't know the encoding, you cannot do a correct conversion at all.
Upvotes: 3