Reputation: 589
I am currently trying to download files from more than 800,000 urls. Each url represent on .txt file.
I am using dataframe to store all the
url information:
index Filename
4 .../data/1000015/0001104659-05-006777.txt
5 .../data/1000015/0000950123-05-003812.txt
......
Code:
for i in m.index:
download = 'ftp:/.../' + m['Filename'][i]
print download
urllib.urlretrieve(download, '%s''%s.txt' % (m['Co_name'][i], m['Date'][i]))
This method works, however, the speed is quite low, it downloads 15 files in 7 minutes. Considering I have more than 800,000 files. It is more than 9 month...So I was wondering could anyone help me improve this? Thank you so much.
After Some really helpful comments, I made some changes, Is the following a good way to do multiprocessing?
Code:
def download(file):
import ftplib
ftp = ftplib.FTP('XXXX')
ftp.login()
for i in m.index:
a = m['Filename'][i]
local_file = os.path.join("local_folder", '%s %s.txt' % (m['Co_name'][i], m['Data'][i]))
fhandle = open(local_file,'wb')
print fhandle
ftp.retrbinary('RETR '+a, fhandle.write)
fhandle.close()
m=pd.read_csv('XXXX.csv', delimiter=',', index_col='index')
pool = Pool(10)
pool.map(download, m)
Upvotes: 1
Views: 817
Reputation: 2976
This way, you establish a new connection for every file. This means you lose a few seconds for every file where nothing is downloaded.
You can reduce this by using ftplib (https://docs.python.org/2/library/ftplib.html), which allows to establish a single connection and retrieve one file by one over this connection.
Still, there is time where no data is transferred. To use the maximum bandwith, use threads, to downloads several files in parallel. But note that some servers limit the numbers of parallel connections.
However, the time overhead should not exceed a few seconds, lets say 5 in worst case. Then, about 25s for a 100kB file is very very slow. I guess your connection is very slow, or the server is. If FTP is not the standard way, may be the FTP server of your main frame is shut down when a connection is terminated and started when a connection is established? Then, FTPlib should help. Still, an overhead of half a second means 400.000 seconds of waiting. So, downloading in parallel makes sense.
May be, you first try a FTP client like filezilla and check what bandwith is possible with it.
Upvotes: 1