Reputation: 355
I wrote a function that takes a list of items with several fields and write each item to one or several files depending on the content of some of the fields.
The name of the files is based on the content of those fields, so for example an item with value AAA
in the field rating
and Spain
in the field country
will end up in the files AAA_firms.txt
, Spain_firms.txt
and Spain_AAA_firms.txt
(just an example, not the real case).
When I first coded it I used 'w+'
as the mode to open the files, what I got was that most of the content of the files seemed to be corrupt, ^@
was the characters I had in the file, and only a few correct entries at the end of the file. For example we are talking of a file of more than 3500 entries with only less than 100 entries at the end being legible, the rest of the file was that ^@
characters.
I could not find the cause so I made it in a different way, I stored all the entries in lists in the dict and then wrote each list to a file in one pass, again opening the file with w+
, and this worked fine but I was left with the curiosity of what happened.
Among other things I tried to change the 'w+'
to 'a+'
, and that works!
I would like to know the exact difference that makes 'w+'
work erratically and 'a+'
work fine.
I left the code below with the mode set to 'w+'
(this way it writes what seems to be garbage to the file).
The code is not 100% real, I had to modify names and is part of class (the source list itself, actually a dict wrapper as you can guess from the code here).
def extractLists(self, outputDir, filenameprefix):
totalEntries = 0
aKey = "rating"
bKey = "country"
nameKey = "name"
representativeChars = 2
fileBase = outputDir + "/" + filenameprefix
filenameAll = fileBase + "_ALL.txt"
xLists = dict()
for item in self.content.values():
if (item[aKey] != aKey):
totalEntries = totalEntries + 1
filenameA = fileBase + "_" + item[aKey]+ "_ANY.txt"
filenameB = fileBase + "_ANY_" + item[bKey][0:representativeBuildingChars]+ ".txt"
filenameAB = fileBase + "_" + item[aKey]+ "_" + item[bKey][0:representativeBuildingChars] + ".txt"
xLists.setdefault(filenameAll,open(filenameAll,"w+")).write(item[nameKey]+"\n")
mailLists.setdefault(filenameA,open(filenameA,"w+")).write(item[nameKey]+"\n")
mailLists.setdefault(filenameB,open(filenameB,"w+")).write(item[nameKey]+"\n")
mailLists.setdefault(filenameAB,open(filenameAB,"w+")).write(item[nameKey]+"\n")
for fileHandle in mailLists.values():
fileHandle.close()
print(totalEntries)
return totalEntries
Upvotes: 1
Views: 66
Reputation: 1122242
You are reopening the file objects each time in the loop, even if already present in the dictionary. The expression:
mailLists.setdefault(filenameA,open(filenameA,"w+"))
opens the file first, as both arguments to setdefault()
need to be available. Using open(..., 'w+')
truncates the file.
This is fine when you do so for the first time the filename is not yet present, but all subsequent times, you just truncated a file for which there is still an open file handle. That already-existing open file handle in the dictionary has a file writing position, and continues to write from that position. Since the file just has been truncated, this leads to the behaviour you observed; corrupted file contents. You'll see multiple entries written as data could still be buffered; only data already flushed to disk is lost.
See this short demo (executed on OSX, different operating systems and filesystems can behave differently):
>>> with open('/tmp/testfile.txt', 'w') as f:
... f.write('The quick brown fox')
... f.flush() # flush the buffer to disk
... open('/tmp/testfile.txt', 'w') # second open call, truncates
... f.write(' jumps over the lazy fox')
...
<open file '/tmp/testfile.txt', mode 'w' at 0x10079b150>
>>> with open('/tmp/testfile.txt', 'r') as f:
... f.read()
...
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 jumps over the lazy fox'
Opening the files in a
append mode doesn't truncate, which is why that change made things work.
Don't keep opening files, only do so when the file is actually missing. You'll have to use an if
statement for that:
if filenameA not in mailLists:
mailLists[filenameA] = open(filenameA, 'w+')
I'm not sure why you are using +
in the filemode however, since you don't appear to be reading from any of the files.
For filenameAll
, that variable name never changes and you don't need to open that file in the loop at all. Move that outside of the loop and open just once.
Upvotes: 2