3yakuya
3yakuya

Reputation: 2672

Can't write human readable words to file in Python

I am trying to build a list of all words appearing in files in a specified directory, and then save this list to a file. When I try to print out any of the list's positions it appears to be ok (it is human readable), but after I write it to a file I see only byte-numbers. Here is my code:

import os

directoryList = ['/Users/Kuba/Desktop/Articles/1', '/Users/Kuba/Desktop/Articles/2', '/Users/Kuba/Desktop/Articles/4']
bigBagOfWords = []

for directory in directoryList:
    for filename in os.listdir(directory):
        filename = os.path.join(directory, filename)
        currentFile = open(filename, 'rt', encoding = 'latin-1')
        for line in currentFile:
            currentLine = line.split(' ')
            for word in currentLine:
                if word.lower() not in bigBagOfWords:
                    bigBagOfWords.append(word.lower())
        currentFile.close()

saveFile = open('dictionary.txt', 'wt', encoding = 'latin-1')
for word in bigBagOfWords:
    saveFile.write(word)
    saveFile.write('\n')
saveFile.close()

File "dictionary.txt" contains lines like this below:

0000 0007 0078 0064 006b 002e 0074 0078 0074 696c 6f63 626c 6f62 0000 0010 0000 00ec 0000 09e8 ffff ffff ffff 0000 0000

How do I force python to write those words in human - readable encoding? Am I doing something significantly wrong here?

Upvotes: 0

Views: 120

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1121904

You've opened a .DS_Store OS X desktop information file and added it to your output file. When you opened the file in Sublime Text the text editor shows you binary files in a columned hex dump format.

The character sequence locblob is a characteristic of the proprietary format. You also have the text xdk.txt in UTF-16 hidden in the hex dump you showed us; the .DS_Store file stores icon positions and other attributes for files on non-native OS X filesystems.

Filter these files out when you are looping over directories. Typically, you want to ignore files starting with .:

for filename in os.listdir(directory):
    if filename[0] == '.':
        continue  # skip hidden files
    filename = os.path.join(directory, filename)

Upvotes: 1

Related Questions