Jsoup get all links from a page

Question

I'm implementing a web robot that has to get all the links from a page and select the needed ones. I got it all working except I encountered a probem where a link is inside a "table" or a "span" tag. Here's my code snippet:

Document doc = Jsoup.connect(url)
    .timeout(TIMEOUT * 1000)
    .get();
Elements elts = doc.getElementsByTag("a");

And here's the example HTML:

My code will not fetch such links. Using doc.select doesn't help too. My question is, how to get all the links from the page?

EDIT: I think I know where the problem is. THe page I'm having trouble with is very badly written, HTML validator throws out tremendous amount of errors. Could this cause problems?

ollo · Accepted Answer

In general Jsoup can handle moste bad HTML. Dump the HTML as JSoup uses it (you can simple output doc.toString()).

Tip: use select() instead of getElementsByX(), its faster and more flexible.

Elements elts = doc.select("a"); (edit)

Here's an overview about the Selector-API: http://jsoup.org/cookbook/extracting-data/selector-syntax

Jsoup get all links from a page

Answers (2)

Related Questions