Brad
Brad

Reputation: 589

quick download files using python

I am currently trying to download files from more than 800,000 urls. Each url represent on .txt file.

I am using dataframe to store all the

url information:

index       Filename                                         
4           .../data/1000015/0001104659-05-006777.txt
5           .../data/1000015/0000950123-05-003812.txt
......

Code:

  for i in m.index:
     download = 'ftp:/.../' + m['Filename'][i]
     print download
     urllib.urlretrieve(download, '%s''%s.txt' % (m['Co_name'][i], m['Date'][i]))

This method works, however, the speed is quite low, it downloads 15 files in 7 minutes. Considering I have more than 800,000 files. It is more than 9 month...So I was wondering could anyone help me improve this? Thank you so much.


After Some really helpful comments, I made some changes, Is the following a good way to do multiprocessing?

Code:

def download(file):
  import ftplib
  ftp = ftplib.FTP('XXXX')
  ftp.login()
  for i in m.index:
  a = m['Filename'][i]
  local_file = os.path.join("local_folder", '%s %s.txt' % (m['Co_name'][i], m['Data'][i]))
  fhandle = open(local_file,'wb')
  print fhandle
  ftp.retrbinary('RETR '+a, fhandle.write)
  fhandle.close()

m=pd.read_csv('XXXX.csv', delimiter=',', index_col='index')

pool = Pool(10)
pool.map(download, m)

Upvotes: 1

Views: 817

Answers (1)

sweber
sweber

Reputation: 2976

This way, you establish a new connection for every file. This means you lose a few seconds for every file where nothing is downloaded.

You can reduce this by using ftplib (https://docs.python.org/2/library/ftplib.html), which allows to establish a single connection and retrieve one file by one over this connection.

Still, there is time where no data is transferred. To use the maximum bandwith, use threads, to downloads several files in parallel. But note that some servers limit the numbers of parallel connections.

However, the time overhead should not exceed a few seconds, lets say 5 in worst case. Then, about 25s for a 100kB file is very very slow. I guess your connection is very slow, or the server is. If FTP is not the standard way, may be the FTP server of your main frame is shut down when a connection is terminated and started when a connection is established? Then, FTPlib should help. Still, an overhead of half a second means 400.000 seconds of waiting. So, downloading in parallel makes sense.

May be, you first try a FTP client like filezilla and check what bandwith is possible with it.

Upvotes: 1

Related Questions