trench
trench

Reputation: 5355

Speeding up code to download files from a SFTP

I wanted to check if there was a faster way to download data from an SFTP which doesn't exist in a folder on a physical computer. The issue is that these files are 5 minute interval snapshots and the current SFTP folders have thousands of them (literally every 5 minutes since at least August, 2016).

I plan on asking the client if they can clean up the SFTP and have a process for removing the older data, but in the meantime I would like to improve my code for downloading as well.

Essentially, I check each folder on the SFTP and then check the corresponding folder on my computer. If the file doesn't exist then I download it (I am using Windows 10 right now). Even listing all of the files and checking if they exist takes a long time though (1400 seconds for just one of the folders which means I can't possible try to run this every 5 minutes).

with pysftp.Connection(host, username, password, port, cnopts) as sftp:
    logger.info('Server connected')
    for folder in folders:
        sftp.chdir(folder)
        logger.info('Downloading data from the {} folder'.format(folder))
        for file in sftp.listdir():
            if file not in os.listdir(os.path.join(path, folder.lower())) and sftp.isfile(file):
                logger.info('Downloading: {}'.format(file))
                os.chdir(os.path.join(path, folder.lower()))
                sftp.get(file, preserve_mtime=True)

Here is the exact filename structure for one of the folders:

filename-2016-12-06-08-55-05-to-09-00-17.csv

This one folder (out of 7 folders) has 30,000 files (only 129MB of data)

Upvotes: 1

Views: 2038

Answers (1)

dorian
dorian

Reputation: 6272

I'm afraid that it's going to be difficult to make this script significantly faster as paramiko is not known for being blazingly fast. If possible at all, this seems more like a job for rsync or the likes. If there's no rsync on the remote host, you could still try to mount the remote file system over sftp and run rsync locally.

Having said that, one thing I noticed is that the expression os.listdir(os.path.join(path, folder.lower())) is evaluated for every remote file, even though it only changes for every iteration of the outer-most loop. So you could construct that list once for every folder and then re-use it. I doubt that it's going to make much of a difference, however.

Upvotes: 1

Related Questions