Reputation: 680
I have around 100k URLS in a database and I want to check if all the URLS are valid. I tried with PHP and curl but its very slow and gives script timeout. Is there any better way to do this using some other shell script?
so far I tried this:
// By default get_headers uses a GET request to fetch the headers. If you
// want to send a HEAD request instead, you can do so using a stream context:
stream_context_set_default(
array(
'http' => array(
'method' => 'HEAD'
)
)
);
$headers = get_headers('http://example.com');
It's running in for loop.
Upvotes: 1
Views: 435
Reputation: 19
my bash solution:
#!/bin/bash
###############################################
# mailto: [email protected]
# checkurls
# https://github.com/ggerman/checkurls
# require curl
###############################################
url() {
cat urls.csv |
replace |
show
}
replace() {
tr ',' ' '
}
show() {
awk '{print $1}'
}
url | \
while read CMD; do
echo $CMD
curl -Is $CMD | head -n 1
done
Upvotes: 0
Reputation: 207465
There is a lot of latency in servers replying, so this problem lends itself to parallelising. Try splitting the list into a number of sublists and running scripts in parallel, each one processing a different list.
Try looking at the split
command to generate the lists.
So, you will get something like this:
#!/bin/bash
split -l 1000 urllist.txt tmpurl # split bigfile into 1000 line subfiles called tmpurl*
for p in tmpurl* # for all tmpurl* files
do
# Start a process to check the URLs in that list
echo start checking file $p in background &
done
wait # till all are finished
Where I have put "start checking file $p in background" you would need to supply a simple PHP or shell script that takes a filename as parameter (or reads from its stdin) and does the checking in a for loop of all the URLs in the file however you are already doing it.
Extra Information:
Just for fun, I made a list of 1,000 URLs and curl
ed headers from each of them, with curl -I -s
. In the sequential case, it took 4 minutes 19 seconds. When I used the above script to split the 1,000 URLs into sub-lists of 100 in each file and started 10 processes, the entire test took 22 seconds - so a 12x speedup. Splitting the list into sublists of 50 URLs, resulted in 20 processes that all completed in 14 seconds. So, as I said, the problem is readily parallelisable.
Upvotes: 1
Reputation: 15
You can use mechanize python module to visit websites and take response from it
Upvotes: 0