Reputation: 2672
I am trying to build a list of all words appearing in files in a specified directory, and then save this list to a file. When I try to print out any of the list's positions it appears to be ok (it is human readable), but after I write it to a file I see only byte-numbers. Here is my code:
import os
directoryList = ['/Users/Kuba/Desktop/Articles/1', '/Users/Kuba/Desktop/Articles/2', '/Users/Kuba/Desktop/Articles/4']
bigBagOfWords = []
for directory in directoryList:
for filename in os.listdir(directory):
filename = os.path.join(directory, filename)
currentFile = open(filename, 'rt', encoding = 'latin-1')
for line in currentFile:
currentLine = line.split(' ')
for word in currentLine:
if word.lower() not in bigBagOfWords:
bigBagOfWords.append(word.lower())
currentFile.close()
saveFile = open('dictionary.txt', 'wt', encoding = 'latin-1')
for word in bigBagOfWords:
saveFile.write(word)
saveFile.write('\n')
saveFile.close()
File "dictionary.txt" contains lines like this below:
0000 0007 0078 0064 006b 002e 0074 0078 0074 696c 6f63 626c 6f62 0000 0010 0000 00ec 0000 09e8 ffff ffff ffff 0000 0000
How do I force python to write those words in human - readable encoding? Am I doing something significantly wrong here?
Upvotes: 0
Views: 120
Reputation: 1121904
You've opened a .DS_Store
OS X desktop information file and added it to your output file. When you opened the file in Sublime Text the text editor shows you binary files in a columned hex dump format.
The character sequence locblob
is a characteristic of the proprietary format. You also have the text xdk.txt
in UTF-16 hidden in the hex dump you showed us; the .DS_Store
file stores icon positions and other attributes for files on non-native OS X filesystems.
Filter these files out when you are looping over directories. Typically, you want to ignore files starting with .
:
for filename in os.listdir(directory):
if filename[0] == '.':
continue # skip hidden files
filename = os.path.join(directory, filename)
Upvotes: 1