Sanchit
Sanchit

Reputation: 131

Read CSV/Excel files from SFTP file, make some changes in those files using Pandas, and save back

I want to read some CSV/Excel files on a secure SFTP folder, make some changes (fixed changes in each file like remove column 2) in those files, upload them to a Postgre DB and also the upload them to a different SFTP path in Python

What's the best way to this?

I have made a connection to the SFTP using pysftp library and am reading the Excel:

import pysftp
import pandas as pd

myHostname = "*****"
myUsername = "****"
myPassword = "***8"
cnopts =pysftp.CnOpts()
cnopts.hostkeys = None  

sftp=pysftp.Connection(host=myHostname, username=myUsername, 
password=myPassword,cnopts=cnopts)
print ("Connection succesfully stablished ... ")
sftp.chdir('test/test')
#sftp.pwd
a=[]
for i in sftp.listdir_attr():
    with sftp.open(i.filename) as f:
        df=pd.read_csv(f)

How should I proceed with the upload to DB and making those changes to the CSV permanent?

Upvotes: 2

Views: 9660

Answers (2)

Martin Prikryl
Martin Prikryl

Reputation: 202360

You have the download part done.

For the upload part, see How to Transfer Pandas DataFrame to .csv on SFTP using Paramiko Library in Python? – While it's for Paramiko, pysftp Connection.open method behaves identically to Paramiko SFTPClient.open, so the code is the same (though, you should not use pysftp).

Full code can be like:

with sftp.open("/remote/path/data.csv", "r+", bufsize=32768) as f:
    # Download CSV contents from SFTP to memory
    df = pd.read_csv(f)

    # Modify as you need (just an example)
    df.at[0, 'Name'] = 'changed'

    # Upload the in-memory data back to SFTP
    f.seek(0)
    df.to_csv(f, index=False)
    # Truncate the remote file in case the new version of the contents is smaller
    f.truncate(f.tell())

The above updates the same file. If you want to upload to a different file, use this:

# Download CSV contents from SFTP to memory
with sftp.open("/remote/path/source.csv", "r") as f:
    df = pd.read_csv(f)

# Modify as you need (just an example)
df.at[0, 'Name'] = 'changed'

# Upload the in-memory data back to SFTP
with sftp.open("/remote/path/target.csv", "w", bufsize=32768) as f:
    df.to_csv(f, index=False)

For the purpose of bufsize, see:
Writing to a file on SFTP server opened using Paramiko/pysftp "open" method is slow


Obligatory warning: Do not set cnopts.hostkeys = None, unless you do not care about security. For the correct solution see Verify host key with pysftp.

Upvotes: 5

Kraxi
Kraxi

Reputation: 103

That's several questions in one question :)

I would suggest go with that approach:

  1. Make a local copy of the file (not sure how big it is, no point to shuffle it around between your local machine and sftp server. You cna use get method
  2. Make operations on your data with pandas, then dump it back to csv with to_csv method
  3. load data to the postgree using either pandas.io or pure SQLAlchemy. Check the docs here
  4. Upload the file to the destination you want with put method

Upvotes: 0

Related Questions