Reputation: 4062
I am recursing through folders and gathering the document names and some other data to be loaded into a database.
import os
text_file = open("Output.txt", "w")
dirName = 'D:\\'
for nextDir, subDir, fileList in os.walk(dirName):
for fname in fileList:
text_file.write(fname + '\n')
The problem is that some document names have foreign characters like:
RC-0964_1000 Tưởng thưởng Diamond trẻ nhất Việt Nam - Đặng Việt Thắng và Trần Thu Phương
And
RC-1046 安麗2013ARTISTRY冰上雅姿盛典-愛里歐娜.薩維琴科_羅賓.索爾科維【Suit & Tie】.mp4
And the code above gives me this error on the last line:
UnicodeEncodeError: 'charmap' codec can't encode characters at positions ##-##:character maps to (undefined)
I've tried to
temp = fname.endcode(utf-8)
temp = fname.decode(utf-8)
temp = fname.encode('ascii','ignore')
temp2 = temp.decode('ascii')
temp =unicode(fname).encode('utf8')
How can I write this script to write all characters to the file? Do I need to change the file I'm writing to or the string I'm writing, and how?
These names can be pasted into the file successfully, so why won't Python write them in?
Upvotes: 3
Views: 3133
Reputation: 414405
By default, text_file
uses locale.getpreferredencoding(False)
(Windows ANSI code page in your case).
os.walk()
uses Unicode API if input path is Unicode on Windows and therefore it may produces names that can't be represented using Windows code page such as cp1252 that leads to UnicodeEncodeError: 'charmap' codec can't encode
error. 8-bit encoding such as cp1252 can represent only 256 characters but there are more than a million Unicode characters.
To fix it, use the character encoding that can represent given names. utf-8, utf-16 character encodings can represent all Unicode characters. You might prefer utf-16 on Windows e.g., so that notepad.exe
would show the file correctly:
with open('output.txt', 'w', encoding='utf-16') as text_file:
print('\N{VICTORY HAND}', file=text_file)
Upvotes: 1
Reputation: 177800
Since it is Python 3, choose an encoding that supports all of Unicode. On Windows, at least, the default is locale dependent, such as cp1252
, and will fail for characters like Chinese.
text_file = open("Output.txt", "w", encoding='utf8')
Upvotes: 6