Reputation: 593
I have a folder into which new files are constantly being added. I have a python script that uses os.listdir() to find these files and then perform analysis on them automatically. However, the files are quite large and so they seem to show up in os.listdir() before they've actually been completely written/copied. Is there some way to distinguish which files are not in the process of being moved? Comparing sizes with os.path.getsize() doesn't seem to work.
Raspbian Buster on Pi4 with Python 3.7.3. I am a noob to programming and linux.
Thanks!
Upvotes: 0
Views: 738
Reputation: 43
In programming this is called concurrency, which is when computations happen simultaneously and the order of execution is not guaranteed. In your case, one program begins to read a file before another program has finished writing to it. This particular problem is called the reader-writers problem and is actually fairly common in embedded systems.
There are a number of solutions to this problem, but the simplest and most common is a lock. The simplest kind of lock protects a resource from being accessed by more than one program at the same time. In effect, it makes sure that operations on the resource happen atomically. A lock is implemented as an object that can be acquired or released (these are usually functions of the object). The program tries to acquire the lock in a loop that iterates for as long as the program does not acquire the lock. When the lock is acquired, it grants the program holding it the ability to execute some block of code (this is usually a simple if-statement), after which the lock is released. Note that what I am referring to as a program is typically called a thread.
In Python, you can use the threading.Lock
object. First, you need to create a Lock object.
from threading import Lock
file_lock = Lock()
Then in each thread, wait to acquire the lock before proceeding. If you set blocking=True
, it will cause the entire thread to stop running until the lock is acquired, without requiring a loop.
file_lock.acquire(blocking=True):
# atomic operation
file_lock.release()
Note that the same lock object should be used in each thread. You will need to acquire the lock before reading and writing to the file, and you will need to release the lock after reading and writing to the file. That will make sure those operations do not happen at the same time again.
Upvotes: 1
Reputation: 11624
For a conceptual explanation of Atomic and cross filesystem moves, refer this moves in Python ( can really save your time)
You can take the following approaches to deal with your problem:-
->Monitor Filesystem Events with Pyinotify usage of Pynotify
-> Lock the file for few seconds using flock
-> Using lsof we can basically check for the processes that are using a particular file.
`from subprocess import check_output,Popen, PIPE
try:
lsout=Popen(['lsof',filename],stdout=PIPE, shell=False)
check_output(["grep",filename], stdin=lsout.stdout, shell=False)
except:
#check_output will throw an exception here if it won't find any process using that file`
just write your log processing code in the except part and you are good to go.
-> a daemon that monitors the parent folder for any changes, by using, E.G., the watchdog library watchdog implementation
-> You can either check the file which is being used by another process by looping through the PID/s in /proc for a specific id (assuming you have the control over the program which is adding the new files continuously to identify its id).
-> Can check if a file has a handle on it using psutil.
Upvotes: 1