Guapo
Guapo

Reputation: 3482

wget download not completing all pages

I need to download over 30k pages on Linux and imagined I could do that with a simple bash script + wget, here is what I came up with:

#!/bin/bash

start_time=$(date +%s)
for i in {1..30802}
do
        echo "Downloading page http://www.domain.com/page:$i"
        wget "http://www.domain.com/page:$i" -q -o /dev/null -b -O pages/$i
        running=$(ps -ef | grep wget | wc -l)
        while [ $running -gt 1000 ]
        do
                running=$(ps -ef | grep wget | wc -l)
                echo "Current running $running process."
                sleep 1;
        done
done

while [ $running -gt 1 ]
do
        running=$(ps -ef | grep wget | wc -l)
        echo "Waiting for all the process to end..."
        sleep 10;
done

finish_time=$(date +%s)
echo "Time duration: $((finish_time - start_time)) secs."

Some pages are not being completely downloaded!

EDIT:

Upvotes: 1

Views: 2364

Answers (2)

felixgaal
felixgaal

Reputation: 2423

Here is a possible solution to your situation:

1) Change the way you call wget to something like this:

(wget "http://www.domain.com/page:$i" -q -o /dev/null -O pages/$i || touch $i.bad) &

2) When your script finishes, search for all *.bad files and relaunch the wget for each of them. Delete the corresponding .bad file before the new retry.

3) Do until no *.bad file exists.

That's the general idea. Hope that helped!

EDIT:

For the situation in which wget processes disappear, are killed or end abruptly, there is a possible refinement:

(wget "http://www.domain.com/page:$i" -q -o /dev/null -O pages/$i || touch $i.bad && touch $i.ok) &

Then you can analyze if some page has been downloaded completely or wget failed to end.

EDIT 2:

After some testings and digging, I've discovered that my former proposal was flawed. The order of the conditionals must be interchanged:

(wget "http://www.domain.com/page:$i" -q -o /dev/null -O pages/$i && touch $i.ok || touch $i.bad) &

So,

  • If the download is executed correctly by wget (i.e. it finished with an OK return code) then there must be two files: the downloaded page and the .ok file.

  • If the download fails (i.e. wget returns a KO return code), then there must be the .bad file, and perhaps a partial download of the page.

In any case, only the .ok files are significant: they say that the download was finished correctly (from the wget point of view, and I will discuss this later).

If no .ok file is found for an specific page, then surely it has not been downloaded, so it must be retried.

Then, we get to the most delicate part of your procedure: what happens if a web server, as a response to that big number of requests, cancels those he cannot serve with an HTTP 200 response and a zero content length? That would be a good technique to avoid web copying or some kind of server attack.

If that's the case, you must take a look at the pattern of the responses. There will be an .ok file, but perhaps the file size of the downloaded page will be zero.

You can detect those zero-length downloads with:

filesize=$(cat $i.html | wc -c)

And then add some logic to the former procedure of .ok and .bad files:

retry=0
if [ -f $i.bad ]
then
  retry=1
elif [ -f $i.ok ]
then
  if [ $filesize -eq 0 ]
  then
    retry=1
  fi
else
  retry=1
fi

if [ $retry -eq 1 ]
then
  # retry the download
fi

Hope this helped!

Upvotes: 2

pizza
pizza

Reputation: 7630

I don't know what kind of connection you have, high number of current connections leads to packet loss. Also consider what kind of connection the server has. If this is not an in-house server, the party that hosts the server might think this is a denial of service attack and filter your IP. It is more reliable just do it one by one. The bottle neck is almost always the internet connection, you can't do it any faster.

Upvotes: 0

Related Questions