Reputation: 926
At the end of my data pipeline when I finally go to push a Python dict to a JSON file to be pulled on demand by an API, I'll dump the dict to the file like so:
json.dump(data, out_file)
99.9% of the time this works perfectly and the data is accessible to the end user in the desired format. i.e. :
out_file.json
{
"good": {
"JSON": "that", "I": "wanted", "to": ["push", ":)"]
},
"more_good": {
"JSON": "that", "I": "wanted", "to": ["push", ":)"]
}
}
However, my struggle is with the other 0.1% of the pushes ... I've been noticing the data will be pushed without completely removing the previous data from the file and I'll end up with situations like the following:
out_file.json
{
"good": {
"JSON": "that", "I": "wanted", "to": ["push", ":)"]
},
"more_good": {
"JSON": "that", "I": "wanted", "to": ["push", ":)"]
}
}ed", "to": ["push", ":)"]}}
As of now, I've come up with the following temporary 'solution':
Before pushing the dict I will push an empty string to clear the file:
json.dump('', out_file)
json.dump(data, out_file)
Then, when getting the file contents for the end user I'll check to ensure content availability like so:
q = json.load(in_file)
while q == '': # also acts as an if
q = json.load(in_file)
return q
My primary concern is that pushing the string prior to the data will only make the edge cases less likely (if even that) and that I will continue to see these same errors occur into the future - with the added potential of end-user data accessibility being disrupted by blank strings being sent constantly to the data file.
Since the problem occurs only 0.1% of the time and I'm not sure of what exactly causes the edge cases, it's been time consuming to test for so I can't be sure how my attempted temporary solutions have panned out yet. The inability to test for the edge cases seems to be a bug in and of itself - caused by a lack of understanding of what brings the bug about in the first place.
Upvotes: 0
Views: 201
Reputation: 926
@Amardan hit the nail on the head when he diagnosed the problem as being caused by multiple threads writing to the same file simultaneously. To solve the problem in my specific use-case I had to diverge slightly from his recommend solution and even incidentally ended up incorporating elements of the solution recommend by @osint_alex.
Unfortunately, when trying to utilize the temporary file recommended by @Amardan. I would receive the following error when trying:
[Errno 18] Invalid cross-device link: '/tmp' -> '/app/data/out_file.json'
This wasn't too big of a problem since the solution truly lied in the ability to write to my files atomically, not in the use of temp files. So, all I had to do was create accessible files of my own to act as temporary holders for the data before writing to the final destination. Ultimately, I ended up using UUID4 to name these temporary files so that no two files would be written to at the same time (at least not any time soon ...). In the end, I was actually able to use this bug as an opportunity to outsource all my 'json.dump'-ing to one function where I can test for edge cases and ensure the file is only written to once at a time. In the end, the new function looks something like this:
def update_content(content, dest):
pth = f'/app/data/{uuid.uuid4()}.json'
with open(pth, "w") as f:
json.dump(content, f)
try:
with open(pth) as f:
q = json.load(f)
# NOTE: edge case testing here ...
os.replace(pth, dest)
except: # add exceptions as you see fit
os.remove(pth)
Upvotes: 0
Reputation: 1022
I think the above is on point. One other solution you could try, although I'm not sure if this works with your use case and it's a bit of hack rather than addressing the root cause, would be to create a unique id for the file.
import json
from uuid import uuid4
f = str(uuid4)
with open(data, 'w') as f:
json.dump(data, f)
But obviously this would only work if you don't need the file to be called 'out_file.json' each time.
Upvotes: 1
Reputation: 198334
You haven't shown what out_file
is or how you open it, but I expect the problem is when two threads/processes try to open and write into the file at roughly the same time. The file is truncated at open; so if the order is open1 - open2 - write1 - write2, you might get the similar results. There are two basic choices:
a) use some locking mechanism to signal an error if another thread/process is doing the same thing: a mutex, an exclusive access lock... then you will have to deal with one of the threads waiting till the file is not in use any more, or give up writing.
b) write to a named temporary file on the same filesystem, then use atomic replace.
I recommend the second choice, it is both simpler and safer.
Upvotes: 2