Reputation: 1043
I am trying to read a file from an FTP server. The file is a .gz
file. I would like to know if I can perform actions on this file while the socket is open. I tried to follow what was mentioned in two StackOverflow questions on reading files without writing to disk and reading files from FTP without downloading but was not successful.
I know how to extract data/work on the downloaded file but I'm not sure if I can do it on the fly. Is there a way to connect to the site, get data in a buffer, possibly do some data extraction and exit?
When trying StringIO I got the error:
>>> from ftplib import FTP
>>> from StringIO import StringIO
>>> ftp = FTP('ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz')
Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
ftp = FTP('ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz')
File "C:\Python27\lib\ftplib.py", line 117, in __init__
self.connect(host)
File "C:\Python27\lib\ftplib.py", line 132, in connect
self.sock = socket.create_connection((self.host, self.port), self.timeout)
File "C:\Python27\lib\socket.py", line 553, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
gaierror: [Errno 11004] getaddrinfo failed
I just need to know how can I get data into some variable and loop on it until the file from FTP is read.
I appreciate your time and help. Thanks!
Upvotes: 14
Views: 25784
Reputation: 14144
Make sure to login to the ftp server first. After this, use retrbinary
which pulls the file in binary mode. It uses a callback on each chunk of the file. You can use this to load it into a string.
from ftplib import FTP
ftp = FTP('ftp.ncbi.nlm.nih.gov')
ftp.login() # Username: anonymous password: anonymous@
# Setup a cheap way to catch the data (could use StringIO too)
data = []
def handle_binary(more_data):
data.append(more_data)
resp = ftp.retrbinary("RETR pub/pmc/PMC-ids.csv.gz", callback=handle_binary)
data = "".join(data)
Bonus points: how about we decompress the string while we're at it?
Easy mode, using data string above
import gzip
import StringIO
zippy = gzip.GzipFile(fileobj=StringIO.StringIO(data))
uncompressed_data = zippy.read()
Little bit better, full solution:
from ftplib import FTP
import gzip
import StringIO
ftp = FTP('ftp.ncbi.nlm.nih.gov')
ftp.login() # Username: anonymous password: anonymous@
sio = StringIO.StringIO()
def handle_binary(more_data):
sio.write(more_data)
resp = ftp.retrbinary("RETR pub/pmc/PMC-ids.csv.gz", callback=handle_binary)
sio.seek(0) # Go back to the start
zippy = gzip.GzipFile(fileobj=sio)
uncompressed = zippy.read()
In reality, it would be much better to decompress on the fly but I don't see a way to do that with the built in libraries (at least not easily).
Upvotes: 30
Reputation: 5818
There are two easy ways I can think of to download a file using FTP and store it locally:
Using ftplib
:
from ftplib import FTP
ftp = FTP('ftp.ncbi.nlm.nih.gov')
ftp.login()
ftp.cwd('pub/pmc')
ftp.retrbinary('RETR PMC-ids.csv.gz', open('PMC-ids.csv.gz', 'wb').write)
ftp.quit()
Using urllib
from urllib import urlretrieve
urlretrieve("ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz", "PMC-ids.csv.gz")
If you don't want to download and store it to a file, but you want to process it gradually as it comes, I suggest using urllib2
:
from urllib2 import urlopen
u = urlopen("ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/readme.txt")
for line in u:
print line
which prints your file line by line.
Upvotes: 6
Reputation: 33126
That is not possible. To process data on the server, you need to have some sort of execution permissions, be it for a shell script you would send or SQL access.
FTP is pure file transfer, no execution allowed. You will need either to enable SSH access, load the data into a Database and access that with queries or download the file with urllib
then process it locally, like this:
import urllib
handle = urllib.urlopen('ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz')
# Use data, maybe: buffer = handle.read()
In particular, I think the third one is the only zero-effort solution.
Upvotes: 0