Yash Mathur
Yash Mathur

Reputation: 41

How do I count the number of line in a FTP file without downloading it locally while using Python

So I need to be able to read and count the number of lines from a FTP server WITHOUT downloading it to my local machine while using Python.

I know the code to connect to the server:

ftp = ftplib.FTP('example.com')  //Object ftp set as server address
ftp.login ('username' , 'password')  // Login info
ftp.retrlines('LIST')  // List file directories
ftp.cwd ('/parent folder/another folder/file/')  //Change file directory

I also know the basic code to count the number of line If it is already downloaded/stored locally :

with open('file') as f:  
...     count = sum(1 for line in f)  
...     print (count)                 

I just need to know how to connect these 2 pieces of code without having to download the file to my local system.

Any help is appreciated. Thank You

Upvotes: 1

Views: 1708

Answers (2)


Reputation: 486

There is a way: I adapted a piece of code that I created for processes csv files "on the fly". Is implement by producer-consumer problem approach. Apply this pattern allows us to assign each task to a thread (or process) and show partial results for huge remote files. You can adapt it for ftp requests.

Download stream is saved in queue and is consumed "on the fly". No HDD extra space is needed and memory efficient. Tested in Python 3.5.2 (vanilla) on Fedora Core 25 x86_64.

This is the source adapted for ftp (over http) retrieve:

from threading import Thread, Event
from queue import Queue, Empty
import urllib.request,sys,csv,io,os,time;
import argparse

FILE_URL = 'http://cdiac.ornl.gov/ftp/ndp030/CSV-FILES/nation.1751_2010.csv'

def download_task(url,chunk_queue,event):

    CHUNK = 1*1024
    response = urllib.request.urlopen(url)

    print('%% - Starting Download  - %%')
    print('%% - ------------------ - %%')
    '''VT100 control codes.'''
    CURSOR_UP_ONE = '\x1b[1A'
    ERASE_LINE = '\x1b[2K'
    while True:
        chunk = response.read(CHUNK)
        if not chunk:
            print('%% - Download completed - %%')

def count_task(chunk_queue, event):
    part = False
    time.sleep(5) #give some time to producer
    contador = 0
    '''VT100 control codes.'''
    CURSOR_UP_ONE = '\x1b[1A'
    ERASE_LINE = '\x1b[2K'
    while True:
            #Default behavior of queue allows getting elements from it and block if queue is Empty.
            #In this case I set argument block=False. When queue.get() and queue Empty ocurrs not block and throws a 
            #queue.Empty exception that I use for show partial result of process.
            chunk = chunk_queue.get(block=False)
            for line in chunk.splitlines(True):
                if line.endswith(b'\n'):
                    if part: ##for treat last line of chunk (normally is a part of line)
                        line = linepart + line
                        part = False
                    M += 1
                ##if line not contains '\n' is last line of chunk. 
                ##a part of line which is completed in next interation over next chunk
                    part = True
                    linepart = line
        except Empty:
            # QUEUE EMPTY 
            print('Downloading records ...')
            if M>0:
                print('Partial result:  Lines: %d ' % M) #M-1 because M contains header
            if (event.is_set()): #'THE END: no elements in queue and download finished (even is set)'
                print(CURSOR_UP_ONE + ERASE_LINE+ CURSOR_UP_ONE)
                print(CURSOR_UP_ONE + ERASE_LINE+ CURSOR_UP_ONE)
                print(CURSOR_UP_ONE + ERASE_LINE+ CURSOR_UP_ONE)
                print('The consumer has waited %s times' % str(contador))
                print('RECORDS = ', M)
            contador += 1
            time.sleep(1) #(give some time for loading more records) 

def main():

    chunk_queue = Queue()
    event = Event()
    args = parse_args()
    url = args.url

    p1 = Thread(target=download_task, args=(url,chunk_queue,event,))
    p2 = Thread(target=count_task, args=(chunk_queue,event,))

# The user of this module can customized one parameter:
#   + URL where the remote file can be found.

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument('-u', '--url', default=FILE_URL,
                        help='remote-csv-file URL')
    return parser.parse_args()

if __name__ == '__main__':


$ python ftp-data.py -u <ftp-file>


python ftp-data-ol.py -u 'http://cdiac.ornl.gov/ftp/ndp030/CSV-FILES/nation.1751_2010.csv' 
The consumer has waited 0 times
RECORDS =  16327

Csv version on Github: https://github.com/AALVAREZG/csv-data-onthefly

Upvotes: 0

As far as i know FTP doesn't provide any kind of functionality to read the file content without actually downloading it. However you could try using something like Is it possible to read FTP files without writing them using Python? (You haven't specified what python you are using)

#!/usr/bin/env python
from ftplib import FTP

def countLines(s):
   print len(s.split('\n'))

ftp = FTP('ftp.kernel.org') 
ftp.retrbinary('RETR /pub/README_ABOUT_BZ2_FILES', countLines)

Please take this code as a reference only

Upvotes: 1

Related Questions