Reputation: 3482
I need to download over 30k pages on Linux and imagined I could do that with a simple bash script + wget, here is what I came up with:
#!/bin/bash
start_time=$(date +%s)
for i in {1..30802}
do
echo "Downloading page http://www.domain.com/page:$i"
wget "http://www.domain.com/page:$i" -q -o /dev/null -b -O pages/$i
running=$(ps -ef | grep wget | wc -l)
while [ $running -gt 1000 ]
do
running=$(ps -ef | grep wget | wc -l)
echo "Current running $running process."
sleep 1;
done
done
while [ $running -gt 1 ]
do
running=$(ps -ef | grep wget | wc -l)
echo "Waiting for all the process to end..."
sleep 10;
done
finish_time=$(date +%s)
echo "Time duration: $((finish_time - start_time)) secs."
Some pages are not being completely downloaded!
Since the above code will make 1k wget parallel running process and wait until it lowers to add more process, could it be that I am actually exhausting all the available internet link ?
How could I make this more reliable to make sure the page is actually being properly downloaded ?
EDIT:
Upvotes: 1
Views: 2364
Reputation: 2423
Here is a possible solution to your situation:
1) Change the way you call wget
to something like this:
(wget "http://www.domain.com/page:$i" -q -o /dev/null -O pages/$i || touch $i.bad) &
2) When your script finishes, search for all *.bad
files and relaunch the wget
for each of them. Delete the corresponding .bad
file before the new retry.
3) Do until no *.bad
file exists.
That's the general idea. Hope that helped!
EDIT:
For the situation in which wget
processes disappear, are killed or end abruptly, there is a possible refinement:
(wget "http://www.domain.com/page:$i" -q -o /dev/null -O pages/$i || touch $i.bad && touch $i.ok) &
Then you can analyze if some page has been downloaded completely or wget
failed to end.
EDIT 2:
After some testings and digging, I've discovered that my former proposal was flawed. The order of the conditionals must be interchanged:
(wget "http://www.domain.com/page:$i" -q -o /dev/null -O pages/$i && touch $i.ok || touch $i.bad) &
So,
If the download is executed correctly by wget
(i.e. it finished with an OK return code) then there must be two files: the downloaded page and the .ok
file.
If the download fails (i.e. wget
returns a KO return code), then there must be the .bad
file, and perhaps a partial download of the page.
In any case, only the .ok
files are significant: they say that the download was finished correctly (from the wget
point of view, and I will discuss this later).
If no .ok
file is found for an specific page, then surely it has not been downloaded, so it must be retried.
Then, we get to the most delicate part of your procedure: what happens if a web server, as a response to that big number of requests, cancels those he cannot serve with an HTTP 200 response and a zero content length? That would be a good technique to avoid web copying or some kind of server attack.
If that's the case, you must take a look at the pattern of the responses. There will be an .ok
file, but perhaps the file size of the downloaded page will be zero.
You can detect those zero-length downloads with:
filesize=$(cat $i.html | wc -c)
And then add some logic to the former procedure of .ok
and .bad
files:
retry=0
if [ -f $i.bad ]
then
retry=1
elif [ -f $i.ok ]
then
if [ $filesize -eq 0 ]
then
retry=1
fi
else
retry=1
fi
if [ $retry -eq 1 ]
then
# retry the download
fi
Hope this helped!
Upvotes: 2
Reputation: 7630
I don't know what kind of connection you have, high number of current connections leads to packet loss. Also consider what kind of connection the server has. If this is not an in-house server, the party that hosts the server might think this is a denial of service attack and filter your IP. It is more reliable just do it one by one. The bottle neck is almost always the internet connection, you can't do it any faster.
Upvotes: 0