CM.
CM.

Reputation: 680

Multiple url exists check

I have around 100k URLS in a database and I want to check if all the URLS are valid. I tried with PHP and curl but its very slow and gives script timeout. Is there any better way to do this using some other shell script?

so far I tried this:

// By default get_headers uses a GET request to fetch the headers. If you
// want to send a HEAD request instead, you can do so using a stream context:
stream_context_set_default(
    array(
        'http' => array(
            'method' => 'HEAD'
        )
    )
);
$headers = get_headers('http://example.com');

It's running in for loop.

Upvotes: 1

Views: 435

Answers (3)

ggerman
ggerman

Reputation: 19

my bash solution:

#!/bin/bash

###############################################
# mailto: [email protected]
# checkurls
# https://github.com/ggerman/checkurls
# require curl
###############################################

url() {
  cat urls.csv | 
  replace  | 
  show
}

replace() {
  tr ',' ' '
}

show() {
  awk '{print $1}'
}

url | \
while read CMD; do
  echo $CMD
  curl -Is $CMD | head -n 1
done

Upvotes: 0

Mark Setchell
Mark Setchell

Reputation: 207465

There is a lot of latency in servers replying, so this problem lends itself to parallelising. Try splitting the list into a number of sublists and running scripts in parallel, each one processing a different list.

Try looking at the split command to generate the lists.

So, you will get something like this:

#!/bin/bash
split -l 1000 urllist.txt tmpurl       # split bigfile into 1000 line subfiles called tmpurl*
for p in tmpurl*                       # for all tmpurl* files
do
   # Start a process to check the URLs in that list
   echo start checking file $p in background &    
done
wait                                   # till all are finished

Where I have put "start checking file $p in background" you would need to supply a simple PHP or shell script that takes a filename as parameter (or reads from its stdin) and does the checking in a for loop of all the URLs in the file however you are already doing it.

Extra Information:

Just for fun, I made a list of 1,000 URLs and curled headers from each of them, with curl -I -s. In the sequential case, it took 4 minutes 19 seconds. When I used the above script to split the 1,000 URLs into sub-lists of 100 in each file and started 10 processes, the entire test took 22 seconds - so a 12x speedup. Splitting the list into sublists of 50 URLs, resulted in 20 processes that all completed in 14 seconds. So, as I said, the problem is readily parallelisable.

Upvotes: 1

Ribson
Ribson

Reputation: 15

You can use mechanize python module to visit websites and take response from it

Upvotes: 0

Related Questions