Reputation: 2672
I have a use case, where a large remote file needs to be downloaded in parts, by using multiple threads. Each thread must run simultaneously (in parallel), grabbing a specific part of the file. The expectation is to combine the parts into a single (original) file, once all parts were successfully downloaded.
Perhaps using the requests library could do the job, but then I am not sure how I would multithread this into a solution that combines the chunks together.
url = 'https://url.com/file.iso'
headers = {"Range": "bytes=0-1000000"} # first megabyte
r = get(url, headers=headers)
I was also thinking of using curl where Python would orchestrate the downloads, but I am not sure that's the correct way to go. It just seems to be too complex and swaying away from the vanilla Python solution. Something like this:
curl --range 200000000-399999999 -o file.iso.part2
Can someone explain how you'd go about something like this? Or post a code example of something that works in Python 3? I usually find the Python-related answers quite easily, but the solution to this problem seems to be eluding me.
Upvotes: 3
Views: 13917
Reputation: 130
The best way i found is to use a module called pySmartDL.
Edit: This module has some issue like there is no way to pause the download and resume it later, also the project isn't actively maintained anymore.
So if you are looking for such features I would like to suggest you try pypdl instead but be aware that it doesn't have some advanced features that pySmartDL offers though for most folks pypdl would be better.
pypdl can pause/resume downloads
pypdl can retry download incase of failure and a option to continue downloading using a different URL if necessary
and many more ...
How to install pypdl
step 1: pip install pypdl
step 2: for downloading the file you could use
from pypdl import Downloader
dl = Downloader()
dl.start('http://example.com/file.txt')
Note: This gives you a download meter and downloads the file to current working directory by default.
In case you need to hook the download progress to a gui or want to give a specific path you could use
dl = Downloader()
dl.start('http://example.com/file.txt', 'downloads/', block=False, display=False)
while dl.progress != 100:
print(d.progress)
if you want to use more threads or give a specific file name you can use
dl = Downloader()
dl.start('http://example.com/file.txt', 'downloads/file.txt', num_connections=8)
you can find many more features from the project page: https://pypi.org/project/pypdl/
Upvotes: 3
Reputation: 2438
You can also you use ThreadPoolExecutor
(or ProcessPoolExecutor
) from concurrent.futures
instead of using asyncio
. The following shows how to modify bug's answer by using ThreadPoolExecutor
:
Bonus: The following snippet also uses tqdm
to show a progress bar of the download. If you don't want to use tqdm
then just comment out the block below with tqdm(total=file_size . . .
. More information on tqdm
is here which can be installed with pip install tqdm
. Btw, tqdm
can also be used with asyncio
.
import requests
import concurrent.futures
from concurrent.futures import as_completed
from tqdm import tqdm
import os
def download_part(url_and_headers_and_partfile):
url, headers, partfile = url_and_headers_and_partfile
response = requests.get(url, headers=headers)
# setting same as below in the main block, but not necessary:
chunk_size = 1024*1024
# Need size to make tqdm work.
size=0
with open(partfile, 'wb') as f:
for chunk in response.iter_content(chunk_size):
if chunk:
size+=f.write(chunk)
return size
def make_headers(start, chunk_size):
end = start + chunk_size - 1
return {'Range': f'bytes={start}-{end}'}
url = 'https://download.samplelib.com/mp4/sample-30s.mp4'
file_name = 'video.mp4'
response = requests.get(url, stream=True)
file_size = int(response.headers.get('content-length', 0))
chunk_size = 1024*1024
chunks = range(0, file_size, chunk_size)
my_iter = [[url, make_headers(chunk, chunk_size), f'{file_name}.part{i}'] for i, chunk in enumerate(chunks)]
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
jobs = [executor.submit(download_part, i) for i in my_iter]
with tqdm(total=file_size, unit='iB', unit_scale=True, unit_divisor=chunk_size, leave=True, colour='cyan') as bar:
for job in as_completed(jobs):
size = job.result()
bar.update(size)
with open(file_name, 'wb') as outfile:
for i in range(len(chunks)):
chunk_path = f'{file_name}.part{i}'
with open(chunk_path, 'rb') as s:
outfile.write(s.read())
os.remove(chunk_path)
Upvotes: 2
Reputation: 4140
Here is a version using Python 3 with Asyncio, it's just an example, it can be improved, but you should be able to get everything you need.
get_size
: Send an HEAD request to get the size of the filedownload_range
: Download a single chunkdownload
: Download all the chunks and merge themimport asyncio
import concurrent.futures
import functools
import requests
import os
# WARNING:
# Here I'm pointing to a publicly available sample video.
# If you are planning on running this code, make sure the
# video is still available as it might change location or get deleted.
# If necessary, replace it with a URL you know is working.
URL = 'https://download.samplelib.com/mp4/sample-30s.mp4'
OUTPUT = 'video.mp4'
async def get_size(url):
response = requests.head(url)
size = int(response.headers['Content-Length'])
return size
def download_range(url, start, end, output):
headers = {'Range': f'bytes={start}-{end}'}
response = requests.get(url, headers=headers)
with open(output, 'wb') as f:
for part in response.iter_content(1024):
f.write(part)
async def download(run, loop, url, output, chunk_size=1000000):
file_size = await get_size(url)
chunks = range(0, file_size, chunk_size)
tasks = [
run(
download_range,
url,
start,
start + chunk_size - 1,
f'{output}.part{i}',
)
for i, start in enumerate(chunks)
]
await asyncio.wait(tasks)
with open(output, 'wb') as o:
for i in range(len(chunks)):
chunk_path = f'{output}.part{i}'
with open(chunk_path, 'rb') as s:
o.write(s.read())
os.remove(chunk_path)
if __name__ == '__main__':
executor = concurrent.futures.ThreadPoolExecutor(max_workers=3)
loop = asyncio.new_event_loop()
run = functools.partial(loop.run_in_executor, executor)
asyncio.set_event_loop(loop)
try:
loop.run_until_complete(
download(run, loop, URL, OUTPUT)
)
finally:
loop.close()
Upvotes: 13
Reputation: 18106
You could use grequests to download in parallel.
import grequests
URL = 'https://cdimage.debian.org/debian-cd/current/amd64/iso-cd/debian-10.1.0-amd64-netinst.iso'
CHUNK_SIZE = 104857600 # 100 MB
HEADERS = []
_start, _stop = 0, 0
for x in range(4): # file size is > 300MB, so we download in 4 parts.
_start = _stop
_stop = 104857600 * (x + 1)
HEADERS.append({"Range": "bytes=%s-%s" % (_start, _stop)})
rs = (grequests.get(URL, headers=h) for h in HEADERS)
downloads = grequests.map(rs)
with open('/tmp/debian-10.1.0-amd64-netinst.iso', 'ab') as f:
for download in downloads:
print(download.status_code)
f.write(download.content)
PS: I did not check if the Ranges are correctly determinded and if the downloaded md5sum matches! This should just show in general how it could work.
Upvotes: 1