Reputation: 141
Stackoverflow has saved my life on countless occasions over the years. Now, it's time for me to post my first question ever, the answer to which I have been unable to find so far.
I have a tool (language/implementation is irrelevant) which accepts a text file as input. This text file (let's call it file_list.txt) contains a long list of file paths, one per line. The tool then iterates over the lines in file_list.txt and does something with every file path. This needs to be done continuously and file_list.txt needs to always contain the latest file paths because users continuously upload or delete files from the share being monitored. To achieve this, I have set up a cron job which calls a script. First the script calls the find utility with the search parameters required and pipes the output to a temporary file. When the file is fully populated, it is moved to file_list.txt. Then, once this is done, the tool is invoked with file_list.txt as an input parameter.
So far, so good. The share being monitored is VERY LARGE (~60 TB) and the find command takes around 5 hours to execute. This is not a problem since we have multiple overlapping find commands running in parallel (triggered once per hour). The entire setup runs on a compute farm, so CPU utilization, etc. is also not an issue.
The problem arises in the lag time for file detection. Ideally, I want a user to add a file and I want one of the already running, overlapping find commands to detect this file within a matter of minutes. However, I have noticed that none of the already-running find commands will detect this file. Only a find command started AFTER this file was added will detect it. This means that generally, I need to wait around 5 hours for a newly added file to be detected. This leads me to believe that the find utility somehow acts on a "cached" version of the share state when it was triggered. Is this true? Can anyone confirm this? And if so, what can I do to improve the detection lag?
Please let me know if further clarificaion is required. I am happy to provide any further details.
Upvotes: 2
Views: 183
Reputation: 37742
executing find
commands and piping the output to temporary files might work up to a certain scale, but is far from optimal. If you want a less resource intensive, more reactive solution, I would recommend considering to reimplement your software using the inotify
interface:
The inotify API provides a mechanism for monitoring filesystem events. Inotify can be used to monitor individual files, or to monitor directories. When a directory is monitored, inotify will return events for the directory itself, and for files inside the directory.
So an event will be raised for each file change; or file being added.
Note that you can then keep an internal list of files up to date which only needs to be changed when you get a event.
Upvotes: 1
Reputation: 249263
To summarize: you have a gigantic filesystem volume (60 TB) which contains a huge number of files, and you use find(1)
to name a large number of those files and put those names into a text file for analysis. You have discovered that files are not listed if they are created after find(1)
was started but before it finished.
I think the best solution is to stop thinking of this as a batch job, and do it "online" using inotify(7)
. You can use the inotify
API to be immediately informed of changes to your filesystem, including new files being created. There is of course the original C API, as well as the excellent pyinotify.
With inotify
, you can start a watcher program once and leave it running continuously (under a supervisor if needed for restarts). The operating system can then notify you whenever a relevant filesystem event occurs, and you can respond immediately rather than waiting for the next scan.
The one downside for your use case might be that the watcher program does need to run on a machine which has the filesystem mounted locally. But the overall compute resources required are probably much less than your current approach of repeated linear scans.
Upvotes: 1