michal-ko
michal-ko

Reputation: 461

Python Paramiko directory walk over SFTP

How to do os.walk() but on another computer through SSH? The problem is that os.walk() executes on a local machine and I want to ssh to another host, walk through a directory and generate MD5 hashes for every file within.

What I wrote so far looks like this (below code) but it doesn't work. Any help would be greatly appreciated.

try:
    hash_array = []
    ssh = paramiko.SSHClient()
    ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
    ssh.connect('sunbeam', port=22, username='xxxx', password='filmlight')

    spinner.start()
    for root, dirs, files in os.walk(_path):
        for file in files:
            file_path = os.path.join(os.path.abspath(root), file)
            
            #  generate hash code for file
            hash_array.append(genMD5hash(file_path))
            
            file_nb += 1
    spinner.stop()
    spinner.ok('Finished.')

    return hash_array
except Exception as e:
    print(e)
    return None
finally:
    ssh.close() 

Upvotes: 6

Views: 6133

Answers (3)

Martin Prikryl
Martin Prikryl

Reputation: 202168

To recursively list a directory using Paramiko, with a standard file access interface, the SFTP, you need to implement a recursive function with a use of SFTPClient.listdir_attr:

from stat import S_ISDIR, S_ISREG
def listdir_r(sftp, remotedir):
    for entry in sftp.listdir_attr(remotedir):
        remotepath = remotedir + "/" + entry.filename
        mode = entry.st_mode
        if S_ISDIR(mode):
            listdir_r(sftp, remotepath)
        elif S_ISREG(mode):
            print(remotepath)

Based on Python pysftp get_r from Linux works fine on Linux but not on Windows.


Alternatively, pysftp implements an os.walk equivalent: Connection.walktree. Though pysftp is dead, so better use it only as an inspiration for your own implementation, not directly.


But you will have troubles getting MD5 of a remote file with SFTP protocol.

While Paramiko supports it with its SFTPFile.check, most SFTP servers (particularly the most widespread SFTP/SSH server – OpenSSH) do not. See:
How to check if Paramiko successfully uploaded a file to an SFTP server? and
How to perform checksums during a SFTP file transfer for data integrity?

So you will most probably have to resort to using shell md5sum command (if you have a shell access at all). And once you have to use the shell anyway, consider listing the files with shell, as that will be magnitudes faster then via SFTP.

See md5 all files in a directory tree.

Use SSHClient.exec_command:
Comparing MD5 of downloaded files against files on an SFTP server in Python


Obligatory warning: Do not use AutoAddPolicy – You are losing a protection against MITM attacks by doing so. For a correct solution, see Paramiko "Unknown Server".

Upvotes: 10

siikamiika
siikamiika

Reputation: 276

Here's another implementation that tries to mimic the os.walk (python/cpython/os.py:walk) function in Python so that it can be used as a drop-in replacement for code that expects the built-in function.

class ParamikoWalkExample:
    def __init__(self, host, username=None):
        self._host = host
        self._username = username
        self._ssh = self._ssh_connect()
        self._sftp = self._ssh.open_sftp()

    def _ssh_connect(self):
        ssh = paramiko.SSHClient()
        ssh.load_system_host_keys()
        ssh.connect(self._host, username=self._username)
        return ssh

    def walk(
        self,
        top,
        topdown=True,
        onerror=None, # ignored
        followlinks=False,
    ):
        stack = [top]

        while stack:
            top = stack.pop()
            if isinstance(top, tuple):
                yield top
                continue

            dirs = []
            nondirs = []
            walk_dirs = []

            for entry in self._sftp.listdir_attr(top):
                if entry.st_mode is None:
                    nondirs.append(entry.filename)
                elif stat.S_ISDIR(entry.st_mode):
                    dirs.append(entry.filename)
                    walk_dirs.append(top + '/' + entry.filename)
                elif stat.S_ISREG(entry.st_mode):
                    nondirs.append(entry.filename)
                elif stat.S_ISLNK(entry.st_mode):
                    target = entry.filename
                    while True:
                        target = self._sftp.readlink(target)
                        if not target:
                            nondirs.append(entry.filename)
                            break
                        target_entry = self._sftp.stat(target)
                        if not target_entry.st_mode:
                            nondirs.append(entry.filename)
                            break

                        if stat.S_ISLNK(target_entry.st_mode):
                            continue

                        elif stat.S_ISDIR(target_entry.st_mode):
                            dirs.append(entry.filename)
                            if followlinks:
                                walk_dirs.append(top + '/' + entry.filename)
                            break
                        elif stat.S_ISREG(target_entry.st_mode):
                            nondirs.append(entry.filename)
                            break

            if topdown:
                yield top, dirs, nondirs
                for new_path in reversed(walk_dirs):
                    stack.append(new_path)
            else:
                # Yield after sub-directory traversal if going bottom up
                stack.append((top, dirs, nondirs))
                for new_path in reversed(walk_dirs):
                    stack.append(new_path)

Side note: the built-in function was recently rewritten on main branch to use a stack instead of recursion due to to recursion errors in deep file hierarchies: https://github.com/python/cpython/issues/89727

Upvotes: 1

Alfonso Sancho
Alfonso Sancho

Reputation: 136

Based on the previous answer, here a version that does not require recursivity and returns a list of paths instead using the print command.

from stat import S_ISDIR, S_ISREG
from collections import deque

def listdir_r(sftp, remotedir):
    dirs_to_explore = deque([remotedir])
    list_of_files = deque([])

    while len(dirs_to_explore) > 0:
        current_dir = dirs_to_explore.popleft()

        for entry in sftp.listdir_attr(current_dir):
            current_fileordir = current_dir + "/" + entry.filename

            if S_ISDIR(entry.st_mode):
                dirs_to_explore.append(current_fileordir)
            elif S_ISREG(entry.st_mode):
                list_of_files.append(current_fileordir)

    return list(list_of_files)

Upvotes: 2

Related Questions