Reputation: 249
I'm implementing a web robot that has to get all the links from a page and select the needed ones. I got it all working except I encountered a probem where a link is inside a "table" or a "span" tag. Here's my code snippet:
Document doc = Jsoup.connect(url)
.timeout(TIMEOUT * 1000)
.get();
Elements elts = doc.getElementsByTag("a");
And here's the example HTML:
<table>
<tr><td><a href="www.example.com"></a></td></tr>
</table>
My code will not fetch such links. Using doc.select doesn't help too. My question is, how to get all the links from the page?
EDIT: I think I know where the problem is. THe page I'm having trouble with is very badly written, HTML validator throws out tremendous amount of errors. Could this cause problems?
Upvotes: 7
Views: 15690
Reputation: 29
Try this code
String url = "http://test.com";
Document doc = null;
try {
doc = Jsoup.connect(url).get();
Elements links = doc.select(<i>"a[href]"<i>);
Element link;
for(int j=0;j<150;j++){
link=links.get(j);
System.out.println("a= " link.attr("abs:href").toString() );
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Upvotes: 0
Reputation: 25340
In general Jsoup can handle moste bad HTML. Dump the HTML as JSoup uses it (you can simple output doc.toString()
).
Tip: use select()
instead of getElementsByX()
, its faster and more flexible.
Elements elts = doc.select("a");
(edit)
Here's an overview about the Selector-API: http://jsoup.org/cookbook/extracting-data/selector-syntax
Upvotes: 6