Basj
Basj

Reputation: 46483

Slow upload of many small files with SFTP

When uploading 100 files of 100 bytes each with SFTP, it takes 17 seconds here (after the connection is established, I don't even count the initial connection time). This means it's 17 seconds to transfer 10 KB only, i.e. 0.59 KB/sec!

I know that sending SSH commands to open, write, close, etc. probably creates a big overhead, but still, is there a way to speed up the process when sending many small files with SFTP?

Or a special mode in paramiko / pysftp to keep all the writes operations to do in a memory buffer (let's say all operations for the last 2 seconds), and then do everything in one grouped pass of SSH/SFTP? This would avoid to wait for the ping time between each operation.

Note:

import pysftp, time, os
with pysftp.Connection('1.2.3.4', username='root', password='') as sftp:
    with sftp.cd('/tmp/'):
        t0 = time.time()
        for i in range(100):
            print(i)
            with sftp.open('test%i.txt' % i, 'wb') as f:   # even worse in a+ append mode: it takes 25 seconds
                f.write(os.urandom(100))
        print(time.time() - t0)

Upvotes: 3

Views: 4216

Answers (2)

Martin Prikryl
Martin Prikryl

Reputation: 202272

I'd suggest you to parallelize the upload using multiple connections from multiple threads. That's easy and reliable solution.


If you want to do the hard way by using buffering the requests, you can base your solution on the following naive example.

The example:

  • Queues 100 file open requests;
  • As it reads the responses to the open requests, it queues write requests;
  • As it reads the responses to the write requests, it queues close requests

If I do plain SFTPClient.put for 100 files, it takes about 10-12 seconds. Using the code below, I achieve the same about 50-100 times faster.

But! The code is really naive:

  • It expects that the server responds to the requests in the same order. Indeed, majority of SFTP servers (including the de-facto standard OpenSSH) respond in the same order. But according to the SFTP specification, an SFTP server is free to respond in any order.
  • The code expects that all file reads happen in one go – upload.localhandle.read(32*1024). That's true for small files only.
  • The code expects that the SFTP server can handle 100 parallel requests and 100 opened files. That's not a problem for most servers, as they process the requests in order. And 100 opened files should not be a problem for a regular server.
  • You cannot do that for unlimited number of files though. You have to queue the files somehow to keep the number of outstanding requests in check. Actually even these 100 requests is too much.
  • The code uses non-public methods of SFTPClient class.
  • I do not do Python. There are definitely ways to code this more elegantly.
import paramiko
import paramiko.sftp
from paramiko.py3compat import long
 
ssh = paramiko.SSHClient()
ssh.connect(...)
 
sftp = ssh.open_sftp()
                      
class Upload:
   def __init__(self):
       pass

uploads = []

for i in range(0, 100):
    print(f"sending open request {i}")
    upload = Upload()
    upload.i = i
    upload.localhandle = open(f"{i}.dat")
    upload.remotepath = f"/remote/path/{i}.dat"
    imode = \
        paramiko.sftp.SFTP_FLAG_CREATE | paramiko.sftp.SFTP_FLAG_TRUNC | \
        paramiko.sftp.SFTP_FLAG_WRITE
    attrblock = paramiko.SFTPAttributes()
    upload.request = \
        sftp._async_request(type(None), paramiko.sftp.CMD_OPEN, upload.remotepath, \
            imode, attrblock)
    uploads.append(upload)

for upload in uploads:
    print(f"reading open response {upload.i}");
    t, msg = sftp._read_response(upload.request)
    if t != paramiko.sftp.CMD_HANDLE:
        raise SFTPError("Expected handle")
    upload.handle = msg.get_binary()

    print(f"sending write request {upload.i} to handle {upload.handle}");
    data = upload.localhandle.read(32*1024)
    upload.request = \
        sftp._async_request(type(None), paramiko.sftp.CMD_WRITE, \
            upload.handle, long(0), data)

for upload in uploads:
    print(f"reading write response {upload.i} {upload.request}");
    t, msg = sftp._read_response(upload.request)
    if t != paramiko.sftp.CMD_STATUS:
        raise SFTPError("Expected status")
    print(f"closing {upload.i} {upload.handle}");
    upload.request = \
        sftp._async_request(type(None), paramiko.sftp.CMD_CLOSE, upload.handle)

for upload in uploads:
    print(f"reading close response {upload.i} {upload.request}");
    sftp._read_response(upload.request)

Upvotes: 3

Basj
Basj

Reputation: 46483

With the following method (100 asynchronous tasks), it's done in ~ 0.5 seconds, which is a massive improvement.

import asyncio, asyncssh  # pip install asyncssh
async def main():
    async with asyncssh.connect('1.2.3.4', username='root', password='') as conn:
        async with conn.start_sftp_client() as sftp:
            print('connected')
            await asyncio.wait([sftp.put('files/test%i.txt' % i) for i in range(100)])
asyncio.run(main())

I'll explore the source, but I still don't know if it groups many operations in few SSH transactions, or if it just runs commands in parallel.

Upvotes: 6

Related Questions