user7568140
user7568140

Reputation: 33

Download lots of small files

I need an efficient way to download lots (millions) of small files from a list of URLs in a text file. I want the files to be saved with new names (from another text file or wherever), since the URLs are long, dynamically generated gibberish and would cause problems with maximum file name lengths etc.

I first tried wget but was limited by the fact that you can either specify a list of URLs from a text file e.g:

wget.exe -i myURLlist.txt

or rename a single downloaded file with a new name, e.g:

wget.exe -O myfilename1.jpg http://www.foo.com/longgibberish976876....2131.jpg

but not both. Therefore my script had to execute wget individually (using the second method) for each file. This is incredibly slow due to restarting the TCP connection each time and other overhead (if you pass a list of URLs in a text file, wget attempts to re-use the connection, but then I can't specify the file names).

I then tried curl which lets you pass multiple URLs and file names via command line arguments. e.g:

curl.exe
-o myfilename1.jpg http://www.foo.com/longgibberish976876....2131.jpg
-o myfilename2.jpg http://www.foo.com/longgibberish324....32432.jpg
-o .....

This was a speed improvement since curl would attempt to re-use the same connection for all the URLs passed to it. However, I was limited to batches of about 20 URLs before it started to skip files. I didn't confirm why this happened, but suspect the maximum length in the command line might have been exceeded. In any case this certainly would not scale to a million or so files. I havn't found the option to pass a text file to curl in the same way as you can with wget.

What options are left? Is there some syntax for the 2 programs I've already tried which I'm not aware of, or do I need some other tool?

Upvotes: 3

Views: 1392

Answers (3)

MC ND
MC ND

Reputation: 70941

With curl you only need a file with the format

output = filename1.jpg
url = http://....
output = filename2.jpg
url = http://....

and use the -K file switch to process it or dynamically generate it and read the list from standard input with -K -.

So, from a url list you can try with this code

@echo off
    setlocal enableextensions disabledelayedexpansion

    set "count=0"
    (for /f "usebackq delims=" %%a in ("urlList.txt") do @(
        >nul set /a "count+=1"
        call echo(output = file%%^^count%%.jpg
        echo(url = %%a
    )) | curl -K -

Or, for really big url lists (for /f needs to load the full file into memory) you can use

@echo off
    setlocal enableextensions disabledelayedexpansion

    < urlList.txt (
        cmd /e /v /q /c"for /l %%a in (1 1 2147483647) do set /p.=&&(echo(output = file%%a.jpg&echo(url = !.!)||exit"
    ) | curl -K - 

notes:

  1. As arithmetic operations in batch files are limited to values lower than 231, those samples will fail if your lists contain more than 2147483647 urls.

  2. First sample will fail with urls longer than aprox. 8180 characters

  3. Second sample will fail with urls longer than 1021 characters and will terminate on empty lines in source file.

Upvotes: 0

zb226
zb226

Reputation: 10529

You can use the aria2 download utility with:

  • the -j [NUMBER] option for concurrent downloads
  • the -i [FILENAME] option to provide the URLs and output file names in a text file

For example, assume files.txt contains:

http://rakudo.org/downloads/star/rakudo-star-2017.01.tar.gz
    out=test1.file
http://rakudo.org/downloads/star/rakudo-star-2017.01.dmg
    out=test2.file
http://rakudo.org/downloads/star/rakudo-star-2017.01-x86_64%20(JIT).msi
    out=test3.file
http://rakudo.org/downloads/star/rakudo-star-2016.11.tar.gz
    out=test4.file

Then you would just run e.g. aria2c -j4 -i files.txt to download all those files in parallel. Not sure how this performs with millions of small files though - but I guess it's worth a shot.

Upvotes: 1

Mark Setchell
Mark Setchell

Reputation: 207818

It is the latency that will do you in. In a normal, sequential process, if there is a latency involved of 1-3 seconds per file, you will pay them all, one after the other and spend 1-3 million seconds downloading a million files.

The trick is to pay the latencies in parallel - put out, say 64, parallel requests and wait for 1-3 seconds for them all to return - instead of the 180 seconds if done sequentially.

I would commend GNU Parallel to you, which although of Unix origin, runs under Cygwin. Please look up some tutorials.

The command will be something like this to do 64 curls at a time:

parallel -j 64 -a filelist.txt curl {}

Upvotes: 1

Related Questions