Reputation: 147
I am new to bash.
I want to wget some resources in parallel.
What is the problem in the following code:
for item in $list
do
if [ $i -le 10 ];then
wget -b $item
let "i++"
else
wait
i=1
fi
When I execute this shell. Error throwed:
fork: Resource temporarily unavailable
My question is how to use wget right way.
My problem is there is about four thousands of url to download, if I let all these jobs work in parallel, fork: Resource temporarily unavailable will throw out. I don't know how to control the count in parallel.
Upvotes: 2
Views: 4532
Reputation: 1060
Use jobs|grep
to check background jobs:
#!/bin/bash
urls=('www.cnn.com' 'www.wikipedia.org') ## input data
for ((i=-1;++i<${#urls[@]};)); do
curl -L -s ${urls[$i]} >file-$i.html & ## background jobs
done
until [[ -z `jobs|grep -E -v 'Done|Terminated'` ]]; do
sleep 0.05; echo -n '.' ## do something while waiting
done
echo; ls -l file*\.html ## list downloaded files
Results:
............................
-rw-r--r-- 1 xxx xxx 155421 Jan 20 00:50 file-0.html
-rw-r--r-- 1 xxx xxx 74711 Jan 20 00:50 file-1.html
Another variance, tasks in simple parallel:
#!/bin/bash
urls=('www.yahoo.com' 'www.hotmail.com' 'stackoverflow.com')
_task1(){ ## task 1: download files
for ((i=-1;++i<${#urls[@]};)); do
curl -L -s ${urls[$i]} >file-$i.html &
done; wait
}
_task2(){ echo hello; } ## task 2: a fake task
_task3(){ echo hi; } ## task 3: a fake task
_task1 & _task2 & _task3 & ## run them in parallel
wait ## and wait for them
ls -l file*\.html ## list results of all tasks
echo done ## and do something
Results:
hello
hi
-rw-r--r-- 1 xxx xxx 320013 Jan 20 02:19 file-0.html
-rw-r--r-- 1 xxx xxx 3566 Jan 20 02:19 file-1.html
-rw-r--r-- 1 xxx xxx 253348 Jan 20 02:19 file-2.html
done
Example with limit how many downloads in parallel at a time (max=3):
#!/bin/bash
m=3 ## max jobs (downloads) at a time
t=4 ## retries for each download
_debug(){ ## list jobs to see (debug)
printf ":: jobs running: %s\n" "$(echo `jobs -p`)"
}
## sample input data
## is redirected to filehandle=3
exec 3<<-EOF
www.google.com google.html
www.hotmail.com hotmail.html
www.wikipedia.org wiki.html
www.cisco.com cisco.html
www.cnn.com cnn.html
www.yahoo.com yahoo.html
EOF
## read data from filehandle=3, line by line
while IFS=' ' read -u 3 -r u f || [[ -n "$f" ]]; do
[[ -z "$f" ]] && continue ## ignore empty input line
while [[ $(jobs -p|wc -l) -ge "$m" ]]; do ## while $m or more jobs in running
_debug ## then list jobs to see (debug)
wait -n ## and wait for some job(s) to finish
done
curl --retry $t -Ls "$u" >"$f" & ## download in background
printf "job %d: %s => %s\n" $! "$u" "$f" ## print job info to see (debug)
done
_debug; wait; ls -l *\.html ## see final results
Outputs:
job 22992: www.google.com => google.html
job 22996: www.hotmail.com => hotmail.html
job 23000: www.wikipedia.org => wiki.html
:: jobs running: 22992 22996 23000
job 23022: www.cisco.com => cisco.html
:: jobs running: 22996 23000 23022
job 23034: www.cnn.com => cnn.html
:: jobs running: 23000 23022 23034
job 23052: www.yahoo.com => yahoo.html
:: jobs running: 23000 23034 23052
-rw-r--r-- 1 xxx xxx 61473 Jan 21 01:15 cisco.html
-rw-r--r-- 1 xxx xxx 155055 Jan 21 01:15 cnn.html
-rw-r--r-- 1 xxx xxx 12514 Jan 21 01:15 google.html
-rw-r--r-- 1 xxx xxx 3566 Jan 21 01:15 hotmail.html
-rw-r--r-- 1 xxx xxx 74711 Jan 21 01:15 wiki.html
-rw-r--r-- 1 xxx xxx 319967 Jan 21 01:15 yahoo.html
After reading your updated question, I think it is much easier to use lftp, which can log and download (automatically follow-link + retry-download + continue-download); you'll never need to worry about job/fork resources because you run only a few lftp commands. Just plit your download list into some smaller lists, and lftp
will download for you:
$ cat downthemall.sh
#!/bin/bash
## run: lftp -c 'help get'
## to know how to use lftp to download files
## with automatically retry+continue
p=() ## pid list
for l in *\.lst; do
lftp -f "$l" >/dev/null & ## run proccesses in parallel
p+=("--pid=$!") ## record pid
done
until [[ -f d.log ]]; do sleep 0.5; done ## wait for the log file
tail -f d.log ${p[@]} ## print results when downloading
Outputs:
$ cat 1.lst
set xfer:log true
set xfer:log-file d.log
get -c http://www.microsoft.com -o micro.html
get -c http://www.cisco.com -o cisco.html
get -c http://www.wikipedia.org -o wiki.html
$ cat 2.lst
set xfer:log true
set xfer:log-file d.log
get -c http://www.google.com -o google.html
get -c http://www.cnn.com -o cnn.html
get -c http://www.yahoo.com -o yahoo.html
$ cat 3.lst
set xfer:log true
set xfer:log-file d.log
get -c http://www.hp.com -o hp.html
get -c http://www.ibm.com -o ibm.html
get -c http://stackoverflow.com -o stack.html
$ rm *log *html;./downthemall.sh
2018-01-22 02:10:13 http://www.google.com.vn/?gfe_rd=cr&dcr=0&ei=leVkWqiOKfLs8AeBvqBA -> /tmp/1/google.html 0-12538 103.1 KiB/s
2018-01-22 02:10:13 http://edition.cnn.com/ -> /tmp/1/cnn.html 0-153601 362.6 KiB/s
2018-01-22 02:10:13 https://www.microsoft.com/vi-vn/ -> /tmp/1/micro.html 0-129791 204.0 KiB/s
2018-01-22 02:10:14 https://www.cisco.com/ -> /tmp/1/cisco.html 0-61473 328.0 KiB/s
2018-01-22 02:10:14 http://www8.hp.com/vn/en/home.html -> /tmp/1/hp.html 0-73136 92.2 KiB/s
2018-01-22 02:10:14 https://www.ibm.com/us-en/ -> /tmp/1/ibm.html 0-32700 131.4 KiB/s
2018-01-22 02:10:15 https://vn.yahoo.com/?p=us -> /tmp/1/yahoo.html 0-318657 208.4 KiB/s
2018-01-22 02:10:15 https://www.wikipedia.org/ -> /tmp/1/wiki.html 0-74711 60.7 KiB/s
2018-01-22 02:10:16 https://stackoverflow.com/ -> /tmp/1/stack.html 0-253033 180.8
Upvotes: 4
Reputation: 11469
With updated question, here is an updated answer.
Following script launches 10 (can be changed to any number) wget
processes in the background and monitors them. Once one of the process finishes, it gets the next one in the list and tries to keep the same $maxn
(10) process running in the background, until it runs out of the urls from the list($urlfile
). There are inline comments to help understand.
$ cat wget.sh
#!/bin/bash
wget_bg()
{
> ./wget.pids # Start with empty pidfile
urlfile="$1"
maxn=$2
cnt=0;
while read -r url
do
if [ $cnt -lt $maxn ] && [ ! -z "$url" ]; then # Only maxn processes will run in the background
echo -n "wget $url ..."
wget "$url" &>/dev/null &
pidwget=$! # This gets the backgrounded pid
echo "$pidwget" >> ./wget.pids # fill pidfile
echo "pid[$pidwget]"
((cnt++));
fi
while [ $cnt -eq $maxn ] # Start monitoring as soon the maxn process hits
do
while read -r pids
do
if ps -p $pids > /dev/null; then # Check pid running
:
else
sed -i "/$pids/d" wget.pids # If not remove it from pidfile
((cnt--)); # decrement counter
fi
done < wget.pids
done
done < "$urlfile"
}
# This runs 10 wget processes at a time in the bg. Modify for more or less.
wget_bg ./test.txt 10
To run:
$ chmod u+x ./wget.sh
$ ./wget.sh
wget blah.com ...pid[13012]
wget whatever.com ...pid[13013]
wget thing.com ...pid[13014]
wget foo.com ...pid[13015]
wget bar.com ...pid[13016]
wget baz.com ...pid[13017]
wget steve.com ...pid[13018]
wget kendal.com ...pid[13019]
Upvotes: 2
Reputation: 2698
Add this in your if statement :
until wget -b $item do
printf '.'
sleep 2
done
The loop will wait process finished and print a "." every 2sec
Upvotes: -2