jonnywalkerr
jonnywalkerr

Reputation: 157

Downloading latest files from remote server without downloading previous files

I am trying to download the latest files uploaded to a server via SFTP. The file information is stored in a table after being downloaded, with information that includes a md5 checksum, timestamp, file name etc.

The script runs as a cron job and wakes up a few times a day to fetch new files from a set of servers. Usually, the number of files is small, so it is easy to just download everything, hash the contents, and compare the result with what exists in the database to determine if a file is new.

However, we now have a server being accessed that does not purge any content. So, downloading and hashing everything is far too costly. It seems the only option is to assess the file's metadata remotely and use this to determine if the file is new.

One solution I thought might work was to use the mtime or ctime of the remote files for comparison with the latest timestamp stored in the file table. The script would then only download files with an mtime or ctime greater than the latest recorded timestamp (sourced from the last download). However, mtime and ctime do not refer to the upload time. So, worse case, a file could be uploaded after the last cron run that has an mtime or ctime less than the most recently recorded timestamp.

The other solution I have considered is treating the file name and timestamp as sort of a composite key and comparing these two attributes to entries in the file table. I am not sure if this is a valid or safe idea. The file names are pretty unique, so maybe this would work. I am really looking for the safest bet in terms of avoiding missed files.

The script that actually does the accessing is written mostly using the phpseclib sftp library.

I do not have ssh access, so a remote checksum is not possible.

Any insight would be greatly appreciated.

Thanks

Upvotes: 1

Views: 629

Answers (1)

Martin Prikryl
Martin Prikryl

Reputation: 202177

You answered your question yourself.

Collect file names, modification times and sizes of remote files. Store them in database. And the next time, collect the same metadata and compare against the previous run. That's the best thing you can do.

It's highly unlikely that a file contents changes, without changing file timestamp or size.


Of course except calculating a file checksum. But phpseclib does not support that. And most SFTP servers (OpenSSH particularly) do not either.
See How to perform checksums during a SFTP file transfer for data integrity?

Upvotes: 1

Related Questions