wget download not completing all pages

Question

I need to download over 30k pages on Linux and imagined I could do that with a simple bash script + wget, here is what I came up with:

#!/bin/bash

start_time=$(date +%s)
for i in {1..30802}
do
        echo "Downloading page http://www.domain.com/page:$i"
        wget "http://www.domain.com/page:$i" -q -o /dev/null -b -O pages/$i
        running=$(ps -ef | grep wget | wc -l)
        while [ $running -gt 1000 ]
        do
                running=$(ps -ef | grep wget | wc -l)
                echo "Current running $running process."
                sleep 1;
        done
done

while [ $running -gt 1 ]
do
        running=$(ps -ef | grep wget | wc -l)
        echo "Waiting for all the process to end..."
        sleep 10;
done

finish_time=$(date +%s)
echo "Time duration: $((finish_time - start_time)) secs."

Some pages are not being completely downloaded!

Since the above code will make 1k wget parallel running process and wait until it lowers to add more process, could it be that I am actually exhausting all the available internet link ?
How could I make this more reliable to make sure the page is actually being properly downloaded ?

EDIT:

I heard that curl is a better option for downloading pages is that true ?

felixgaal · Accepted Answer

Here is a possible solution to your situation:

1) Change the way you call wget to something like this:

(wget "http://www.domain.com/page:$i" -q -o /dev/null -O pages/$i || touch $i.bad) &

2) When your script finishes, search for all *.bad files and relaunch the wget for each of them. Delete the corresponding .bad file before the new retry.

3) Do until no *.bad file exists.

That's the general idea. Hope that helped!

EDIT:

For the situation in which wget processes disappear, are killed or end abruptly, there is a possible refinement:

(wget "http://www.domain.com/page:$i" -q -o /dev/null -O pages/$i || touch $i.bad && touch $i.ok) &

Then you can analyze if some page has been downloaded completely or wget failed to end.

EDIT 2:

After some testings and digging, I've discovered that my former proposal was flawed. The order of the conditionals must be interchanged:

(wget "http://www.domain.com/page:$i" -q -o /dev/null -O pages/$i && touch $i.ok || touch $i.bad) &

So,

If the download is executed correctly by wget (i.e. it finished with an OK return code) then there must be two files: the downloaded page and the .ok file.
If the download fails (i.e. wget returns a KO return code), then there must be the .bad file, and perhaps a partial download of the page.

In any case, only the .ok files are significant: they say that the download was finished correctly (from the wget point of view, and I will discuss this later).

If no .ok file is found for an specific page, then surely it has not been downloaded, so it must be retried.

Then, we get to the most delicate part of your procedure: what happens if a web server, as a response to that big number of requests, cancels those he cannot serve with an HTTP 200 response and a zero content length? That would be a good technique to avoid web copying or some kind of server attack.

If that's the case, you must take a look at the pattern of the responses. There will be an .ok file, but perhaps the file size of the downloaded page will be zero.

You can detect those zero-length downloads with:

filesize=$(cat $i.html | wc -c)

And then add some logic to the former procedure of .ok and .bad files:

retry=0
if [ -f $i.bad ]
then
  retry=1
elif [ -f $i.ok ]
then
  if [ $filesize -eq 0 ]
  then
    retry=1
  fi
else
  retry=1
fi

if [ $retry -eq 1 ]
then
  # retry the download
fi

Hope this helped!

wget download not completing all pages

Answers (2)

Related Questions