Reputation: 2999
I received thousands of Excel files to process. When I open them, the data appears to be encoded in a way that I can read and process with Python.
The file names, however, are mangled. I imported the file names into sqlite and then exported the list of them to CSV to try importing into Excel with the proper encoding.
This is how they appear in the file system:
This is how the names appear if I tell Excel to import as 28596: Arabic (ISO)
, which I'm assuming maps to iso8859_6
python 3.5 encoding.
Excel itself doesn't display them correctly after the import. This is how they look, which I assume is a font issue.
Anyhow, if I import these file names into Python, I can't encode/decode them without errors. If I set errors to ignore
then I don't see the file names.
Any idea how to encode these to a standard Unicode Arabic that will display properly alongside all of the other Arabic text I'm working with?
Here's one example of how it appears in the file explorer on Windows and Finder on MacOS.
½ñΘ Ω⌐αε δτßí ñáƒóƒ ƒΘª¼á ƒΘßá∩í Θ¼∞⌐ 4-2016.xlsx
Edit:
Here's what I've tried in code... I have the filenames in a sqlite database, so I fetch them from there. (By the way, I don't have a problem with 99.9% of the Arabic I'm dealing with -- just these file names.)
import dataset
db = dataset.connect("sqlite:///mydata.sqlite")
# Hit on one of the characters that appears in the garbled file names
res = db.query("SELECT * FROM files WHERE file_name LIKE '%Ω%'")
file_names = [r['file_name'] for r in res]
test = file_names[0]
print(test)
>> '½ñΘ Ω⌐αε δτßí ñáƒóƒ ƒΘª¼á ƒΘßá∩í Θ¼∞⌐ 4-2016.xlsx'
Trying a few things:
test.encode('iso8859_6')
That leads to an error.
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-10-9c734319c359> in <module>()
----> 1 test.encode('iso8859_6')
C:\ProgramData\Anaconda3\lib\encodings\iso8859_6.py in encode(self, input, errors)
10
11 def encode(self,input,errors='strict'):
---> 12 return codecs.charmap_encode(input,errors,encoding_table)
13
14 def decode(self,input,errors='strict'):
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-2: character maps to <undefined>
Try with the codecs library
import codecs
codecs.encode(test,encoding='iso8859_6')
Same error as above.
codecs.encode(test,encoding='iso8859_6',errors='ignore')
>> b' 4-2016.xlsx'
Another try:
codecs.encode(test,encoding='iso8859_6',errors='ignore').decode('utf-8')
>> ' 4-2016.xlsx'
Try the other way around to convert it to bytes and then to the iso format:
codecs.encode(test,encoding='utf-8',errors='ignore')
>> b'\xc2\xbd\xc3\xb1\xce\x98 \xce\xa9\xe2\x8c\x90\xce\xb1\xce\xb5 \xce\xb4\xcf\x84\xc3\x9f\xc3\xad \xc3\xb1\xc3\xa1\xc6\x92\xc3\xb3\xc6\x92 \xc6\x92\xce\x98\xc2\xaa\xc2\xbc\xc3\xa1 \xc6\x92\xce\x98\xc3\x9f\xc3\xa1\xe2\x88\xa9\xc3\xad \xce\x98\xc2\xbc\xe2\x88\x9e\xe2\x8c\x90 4-2016.xlsx'
Chaining with decode...
codecs.encode(test,encoding='utf-8',errors='ignore').decode('iso8859_6')
This error:
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-22-4a3c96284d09> in <module>()
----> 1 codecs.encode(test,encoding='utf-8',errors='ignore').decode('iso8859_6')
C:\ProgramData\Anaconda3\lib\encodings\iso8859_6.py in decode(self, input, errors)
13
14 def decode(self,input,errors='strict'):
---> 15 return codecs.charmap_decode(input,errors,decoding_table)
16
17 class IncrementalEncoder(codecs.IncrementalEncoder):
UnicodeDecodeError: 'charmap' codec can't decode byte 0xbd in position 1: character maps to <undefined>
So... maybe that's the wrong encoding?
To be honest, I don't really know where to take it from there because I'm not very familiar with the various encodings for Arabic.
Upvotes: 0
Views: 748
Reputation: 104722
This one's tricky. Your sqlite
database is sending you improperly decoded data. It's used Codepage 437 rather than Codepage 720. You can fix this by reversing the wrong encoding and then decoding properly:
filename = '½ñΘ Ω⌐αε δτßí ñáƒóƒ ƒΘª¼á ƒΘßá∩í Θ¼∞⌐ 4-2016.xlsx'
filename_fixed = filename.encode('cp437').decode('cp720')
print(filename_fixed) # prints "سجل مرضى نقطة جباتا الخشب الطبية لشهر 4-2016.xlsx"
Upvotes: 1