some user
some user

Reputation: 347

How to find modified files in Python

I want to monitor a folder and see if any new files are added, or existing files are modified. The problem is, it's not guaranteed that my program will be running all the time (so, inotify based solutions may not be suitable here). I need to cache the status of the last scan and then with the next scan I need to compare it with the last scan before processing the files.

What are the alternatives for achieving this in Python 2.7?

Note1: Processing of the files is expensive, so I'm trying to process the files that are not modified in the meantime. So, if the file is only renamed (as opposed to a change in the contents of the file), I would also like to detect that and skip the processing.

Note2: I'm only interested in a Linux solution, but I wouldn't complain if answers for other platforms are added.

Upvotes: 0

Views: 1527

Answers (4)

johntellsall
johntellsall

Reputation: 15180

I suggest cheating and using the system find command. For example, the following finds all Python files that have been modified or created in the last 60 minutes. Using the ls output can determine if further checking is needed.

$ echo beer > zoot.py
$ find . -name '*.py' -mmin -60 -type f -ls
1973329    4 -rw-r--r--   1 johnm    johnm           5 Aug 30 15:17 ./zoot.py

Upvotes: 0

Tom Zych
Tom Zych

Reputation: 13596

There are several ways to detect changes in files. Some are easier to fool than others. It doesn't sound like this is a security issue; more like good faith is assumed, and you just need to detect changes without having to outwit an adversary.

You can look at timestamps. If files are not renamed, this is a good way to detect changes. If they are renamed, timestamps alone wouldn't suffice to reliably tell one file from another. os.stat will tell you the time a file was last modified.

You can look at inodes, e.g., ls -li. A file's inode number may change if changes involve creating a new file and removing the old one; this is how emacs typically changes files, for example. Try changing a file with the standard tool your organization uses, and compare inodes before and after; but bear in mind that even if it doesn't change this time, it might change under some circumstances. os.stat will tell you inode numbers.

You can look at the content of the files. cksum computes a small CRC checksum on a file; it's easy to beat if someone wants to. Programs such as sha256sum compute a secure hash; it's infeasible to change a file without changing such a hash. This can be slow if the files are large. The hashlib module will compute several kinds of secure hashes.

If a file is renamed and changed, and its inode number changes, it would be potentially very difficult to match it up with the file it used to be, unless the data in the file contains some kind of immutable and unique identifier.

Think about concurrency. Is it possible that someone will be changing a file while the program runs? Beware of race conditions.

Upvotes: 1

Chris Johnson
Chris Johnson

Reputation: 22026

Monitoring for new files isn't hard -- just keep a list or database of inodes for all files in the directory. A new file will introduce a new inode. This will also help you avoid processing renamed files, since inode doesn't change on rename.

The harder problem is monitoring for file changes. If you also store file size per inode, then obviously a changed size indicates a changed file and you don't need to open and process the file to know that. But for a file that has (a) a previously recorded inode, and (b) is the same size as before, you will need to process the file (e.g. compute a checksum) to know if it has changed.

Upvotes: 1

amitizle
amitizle

Reputation: 971

I would've probably go with some kind of sqlite solution, such as writing the last polling time. Then on each such poll, sort the files by last_modified_time (mtime) and get all the ones who are having mtime greater than your previous poll (this value will be taken out of the sqlite or some kind of file if you insist on not having requirement of such db).

Upvotes: 1

Related Questions