Marcin Krzysiak
Marcin Krzysiak

Reputation: 249

Jsoup get all links from a page

I'm implementing a web robot that has to get all the links from a page and select the needed ones. I got it all working except I encountered a probem where a link is inside a "table" or a "span" tag. Here's my code snippet:

Document doc = Jsoup.connect(url)
    .timeout(TIMEOUT * 1000)
    .get();
Elements elts = doc.getElementsByTag("a");

And here's the example HTML:

<table>
  <tr><td><a href="www.example.com"></a></td></tr>
</table>

My code will not fetch such links. Using doc.select doesn't help too. My question is, how to get all the links from the page?

EDIT: I think I know where the problem is. THe page I'm having trouble with is very badly written, HTML validator throws out tremendous amount of errors. Could this cause problems?

Upvotes: 7

Views: 15690

Answers (2)

Ahmed Abderrahman
Ahmed Abderrahman

Reputation: 29

Try this code

 String url =  "http://test.com";
 Document doc = null;
        try {
            doc = Jsoup.connect(url).get();
            Elements links = doc.select(<i>"a[href]"<i>);
            Element link;

                for(int j=0;j<150;j++){
                    link=links.get(j);
                    System.out.println("a= " link.attr("abs:href").toString() ); 
            }

        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

Upvotes: 0

ollo
ollo

Reputation: 25340

In general Jsoup can handle moste bad HTML. Dump the HTML as JSoup uses it (you can simple output doc.toString()).

Tip: use select() instead of getElementsByX(), its faster and more flexible.

Elements elts = doc.select("a"); (edit)

Here's an overview about the Selector-API: http://jsoup.org/cookbook/extracting-data/selector-syntax

Upvotes: 6

Related Questions