JSoup Parse text and links in sequence from html file

Question

I am trying to extract the text and links from an html file. At the moment i can extract both easily using JSoup but i can only do it seperately.

Here is my code:

try {
          doc = (Document) Jsoup.parse(new File(input), "UTF-8");
          Elements paragraphs = ((Element) doc).select("td.text");

          for(Element p : paragraphs){
           // System.out.println(p.text()+ "
" + "***********************************************************" + "
");
            getGui().setTextVers(p.text()+ "
" + "***********************************************************" + "
");

          }
          Elements links = doc.getElementsByTag("a");
          for (Element link : links) {
            String linkHref = link.attr("href");
            String linkText = link.text();
            getGui().setTextVers("

"+link.text() + ">
" +linkHref + "
");
          }
}

I have placed a .text class on the outer most td where there is text. what i would like to achieve is: When the program finds a td with the .text class it checks it for any links and extracts them from that section in order. So you would have:

Text

Link

Text

Link

I tried putting an inner for each loop into the first foreach loop but this only printed the full list of links for the page, can anyone help?

Syam S · Accepted Answer

Try

Document doc = (Document) Jsoup.parse(new File(input), "UTF-8");
Elements paragraphs = ((Element) doc).select("td.text");

for (Element p : paragraphs) {
    System.out.println(p.text());
    Elements links =  p.getElementsByTag("a");
    for (Element link : links) {
        String linkHref = link.attr("href");
        String linkText = link.text();
        System.out.println("

" + linkText + ">
" + linkHref + "
");
    }
}

JSoup Parse text and links in sequence from html file

Answers (1)

Related Questions