jasonxia23
jasonxia23

Reputation: 147

how to wait wget finished to get more resources

I am new to bash.

I want to wget some resources in parallel.

What is the problem in the following code:

for item in $list
do
  if [ $i -le 10 ];then
    wget -b $item
    let "i++"
  else
    wait
    i=1
  fi

When I execute this shell. Error throwed:

fork: Resource temporarily unavailable

My question is how to use wget right way.

Edit:

My problem is there is about four thousands of url to download, if I let all these jobs work in parallel, fork: Resource temporarily unavailable will throw out. I don't know how to control the count in parallel.

Upvotes: 2

Views: 4532

Answers (3)

Bach Lien
Bach Lien

Reputation: 1060

Use jobs|grep to check background jobs:

#!/bin/bash

urls=('www.cnn.com' 'www.wikipedia.org')  ## input data

for ((i=-1;++i<${#urls[@]};)); do
  curl -L -s ${urls[$i]} >file-$i.html &  ## background jobs
done

until [[ -z `jobs|grep -E -v 'Done|Terminated'` ]]; do
  sleep 0.05; echo -n '.'                 ## do something while waiting
done

echo; ls -l file*\.html                   ## list downloaded files

Results:

............................
-rw-r--r-- 1 xxx xxx 155421 Jan 20 00:50 file-0.html
-rw-r--r-- 1 xxx xxx  74711 Jan 20 00:50 file-1.html

Another variance, tasks in simple parallel:

#!/bin/bash

urls=('www.yahoo.com' 'www.hotmail.com' 'stackoverflow.com')

_task1(){                                  ## task 1: download files
  for ((i=-1;++i<${#urls[@]};)); do
    curl -L -s ${urls[$i]} >file-$i.html &
  done; wait
}
_task2(){ echo hello; }                    ## task 2: a fake task
_task3(){ echo hi; }                       ## task 3: a fake task

_task1 & _task2 & _task3 &                 ## run them in parallel
wait                                       ## and wait for them

ls -l file*\.html                          ## list results of all tasks
echo done                                  ## and do something

Results:

hello
hi
-rw-r--r-- 1 xxx xxx 320013 Jan 20 02:19 file-0.html
-rw-r--r-- 1 xxx xxx   3566 Jan 20 02:19 file-1.html
-rw-r--r-- 1 xxx xxx 253348 Jan 20 02:19 file-2.html
done

Example with limit how many downloads in parallel at a time (max=3):

#!/bin/bash

m=3                                            ## max jobs (downloads) at a time
t=4                                            ## retries for each download

_debug(){                                      ## list jobs to see (debug)
  printf ":: jobs running: %s\n" "$(echo `jobs -p`)"
}

## sample input data
## is redirected to filehandle=3
exec 3<<-EOF
www.google.com google.html
www.hotmail.com hotmail.html
www.wikipedia.org wiki.html
www.cisco.com cisco.html
www.cnn.com cnn.html
www.yahoo.com yahoo.html
EOF

## read data from filehandle=3, line by line
while IFS=' ' read -u 3 -r u f || [[ -n "$f" ]]; do
  [[ -z "$f" ]] && continue                  ## ignore empty input line
  while [[ $(jobs -p|wc -l) -ge "$m" ]]; do  ## while $m or more jobs in running
    _debug                                   ## then list jobs to see (debug)
    wait -n                                  ## and wait for some job(s) to finish
  done
  curl --retry $t -Ls "$u" >"$f" &           ## download in background
  printf "job %d: %s => %s\n" $! "$u" "$f"   ## print job info to see (debug)
done

_debug; wait; ls -l *\.html                  ## see final results

Outputs:

job 22992: www.google.com => google.html
job 22996: www.hotmail.com => hotmail.html
job 23000: www.wikipedia.org => wiki.html
:: jobs running: 22992 22996 23000
job 23022: www.cisco.com => cisco.html
:: jobs running: 22996 23000 23022
job 23034: www.cnn.com => cnn.html
:: jobs running: 23000 23022 23034
job 23052: www.yahoo.com => yahoo.html
:: jobs running: 23000 23034 23052
-rw-r--r-- 1 xxx xxx  61473 Jan 21 01:15 cisco.html
-rw-r--r-- 1 xxx xxx 155055 Jan 21 01:15 cnn.html
-rw-r--r-- 1 xxx xxx  12514 Jan 21 01:15 google.html
-rw-r--r-- 1 xxx xxx   3566 Jan 21 01:15 hotmail.html
-rw-r--r-- 1 xxx xxx  74711 Jan 21 01:15 wiki.html
-rw-r--r-- 1 xxx xxx 319967 Jan 21 01:15 yahoo.html

After reading your updated question, I think it is much easier to use lftp, which can log and download (automatically follow-link + retry-download + continue-download); you'll never need to worry about job/fork resources because you run only a few lftp commands. Just plit your download list into some smaller lists, and lftp will download for you:

$ cat downthemall.sh 
#!/bin/bash

## run: lftp -c 'help get'
## to know how to use lftp to download files
## with automatically retry+continue

p=()                                     ## pid list

for l in *\.lst; do
  lftp -f "$l" >/dev/null &              ## run proccesses in parallel
  p+=("--pid=$!")                        ## record pid
done

until [[ -f d.log ]]; do sleep 0.5; done ## wait for the log file
tail -f d.log ${p[@]}                    ## print results when downloading

Outputs:

$ cat 1.lst 
set xfer:log true
set xfer:log-file d.log
get -c http://www.microsoft.com -o micro.html
get -c http://www.cisco.com     -o cisco.html
get -c http://www.wikipedia.org -o wiki.html

$ cat 2.lst 
set xfer:log true
set xfer:log-file d.log
get -c http://www.google.com    -o google.html
get -c http://www.cnn.com       -o cnn.html
get -c http://www.yahoo.com     -o yahoo.html

$ cat 3.lst 
set xfer:log true
set xfer:log-file d.log
get -c http://www.hp.com        -o hp.html
get -c http://www.ibm.com       -o ibm.html
get -c http://stackoverflow.com -o stack.html

$  rm *log *html;./downthemall.sh
2018-01-22 02:10:13 http://www.google.com.vn/?gfe_rd=cr&dcr=0&ei=leVkWqiOKfLs8AeBvqBA -> /tmp/1/google.html 0-12538 103.1 KiB/s
2018-01-22 02:10:13 http://edition.cnn.com/ -> /tmp/1/cnn.html 0-153601 362.6 KiB/s
2018-01-22 02:10:13 https://www.microsoft.com/vi-vn/ -> /tmp/1/micro.html 0-129791 204.0 KiB/s
2018-01-22 02:10:14 https://www.cisco.com/ -> /tmp/1/cisco.html 0-61473 328.0 KiB/s
2018-01-22 02:10:14 http://www8.hp.com/vn/en/home.html -> /tmp/1/hp.html 0-73136 92.2 KiB/s
2018-01-22 02:10:14 https://www.ibm.com/us-en/ -> /tmp/1/ibm.html 0-32700 131.4 KiB/s
2018-01-22 02:10:15 https://vn.yahoo.com/?p=us -> /tmp/1/yahoo.html 0-318657 208.4 KiB/s
2018-01-22 02:10:15 https://www.wikipedia.org/ -> /tmp/1/wiki.html 0-74711 60.7 KiB/s
2018-01-22 02:10:16 https://stackoverflow.com/ -> /tmp/1/stack.html 0-253033 180.8

Upvotes: 4

iamauser
iamauser

Reputation: 11469

With updated question, here is an updated answer.

Following script launches 10 (can be changed to any number) wget processes in the background and monitors them. Once one of the process finishes, it gets the next one in the list and tries to keep the same $maxn(10) process running in the background, until it runs out of the urls from the list($urlfile). There are inline comments to help understand.

$ cat wget.sh
#!/bin/bash

wget_bg()
{
    > ./wget.pids # Start with empty pidfile
    urlfile="$1"
    maxn=$2
    cnt=0;
    while read -r url
    do
        if [ $cnt -lt $maxn ] && [ ! -z "$url" ]; then # Only maxn processes will run in the background
            echo -n "wget $url ..."
            wget "$url" &>/dev/null &
            pidwget=$! # This gets the backgrounded pid
            echo "$pidwget" >> ./wget.pids # fill pidfile
            echo "pid[$pidwget]"
            ((cnt++));
        fi
        while [ $cnt -eq $maxn ] # Start monitoring as soon the maxn process hits
        do
            while read -r pids
            do
                if ps -p $pids > /dev/null; then # Check pid running
                  :
                else
                  sed -i "/$pids/d" wget.pids # If not remove it from pidfile
                  ((cnt--)); # decrement counter
                fi
            done < wget.pids
        done
    done < "$urlfile"
}    
# This runs 10 wget processes at a time in the bg. Modify for more or less.
wget_bg ./test.txt 10 

To run:

$ chmod u+x ./wget.sh 
$ ./wget.sh
wget blah.com ...pid[13012]
wget whatever.com ...pid[13013]
wget thing.com ...pid[13014]
wget foo.com ...pid[13015]
wget bar.com ...pid[13016]
wget baz.com ...pid[13017]
wget steve.com ...pid[13018]
wget kendal.com ...pid[13019]

Upvotes: 2

L&#233;o R.
L&#233;o R.

Reputation: 2698

Add this in your if statement :

until wget -b $item do
    printf '.'
    sleep 2
done

The loop will wait process finished and print a "." every 2sec

Upvotes: -2

Related Questions