Reputation: 28365
I would like to scan some websites looking for broken links, preferably using Java. Any hint how can I start doing this?
(I know there are some websites that do this, but I want to make my own personalized log file)
Upvotes: 2
Views: 3915
Reputation: 5895
Writing a web-crawler isn't as simple as just reading the static HTML, if the page uses JavaScript to modify the DOM then it gets complex. You will also need to look for pages you've already visited aka Spider Traps? If the site is pure static HTML, then go for it... But if the site uses Jquery and is large, expect it to be complex.
If your site is all static, small and has little or no JS then use the answers already listed.
Or
You could use Heritrix and then later parsed it's crawl.log for 404's. Heritrix doc on crawl.log
Or If you most write your own:
You could use something like HTMLUnit (it has a JavaScript engine) to load the page, then query the DOM object for links. Then place each link in a "unvisited" queue, then pull links from the unvisited queue to get your next url to load, if the page fails to load, report it.
To avoid duplicate pages (spider traps) you could hash each link and keep a HashTable of visited pages (see CityHash ). Before placing a link into the unvisited queue check it against the visited hashtable.
To avoid leaving your site check that the URL is in a safe domain list before adding it to the unvisited queue. If you want to confirm that the off domain links are good, then keep them in a offDomain queue. Then later load each link from this queue using URL.getContent(url) to see if they work (faster than using HTMLUnit and you don't need to parse the page anyway.).
Upvotes: 4
Reputation: 24722
<a>
tag, get its content and attempt to connect to it. If necessary, repeat recursively if URL from <a>
belongs to your site. Make sure to store URLs that you processed already in a map so you don't do it more than once.
Upvotes: 0
Reputation: 4073
Write a function which recursively checks links. Pseudo Code:
function checklinks(String url){
try{
content=HTTP.getContents(url);
String[] links=content.getAllRegexMatches('href="(http://.*?)"');
foreach(links as String link)
checklinks(link)
} catch (Exception e) {
System.out.println("Link "+url" failed");
}
}
Depending on the Links you have to complete the link passed to the next recursion by adding the url relative to the current URL.
Upvotes: 0