Reputation: 53
I am trying to extract the text and links from an html file. At the moment i can extract both easily using JSoup but i can only do it seperately.
Here is my code:
try {
doc = (Document) Jsoup.parse(new File(input), "UTF-8");
Elements paragraphs = ((Element) doc).select("td.text");
for(Element p : paragraphs){
// System.out.println(p.text()+ "\r\n" + "***********************************************************" + "\r\n");
getGui().setTextVers(p.text()+ "\r\n" + "***********************************************************" + "\r\n");
}
Elements links = doc.getElementsByTag("a");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
getGui().setTextVers("\n\n"+link.text() + ">\r\n" +linkHref + "\r\n");
}
}
I have placed a .text class on the outer most td where there is text. what i would like to achieve is: When the program finds a td with the .text class it checks it for any links and extracts them from that section in order. So you would have:
Text
Link
Text
Link
I tried putting an inner for each loop into the first foreach loop but this only printed the full list of links for the page, can anyone help?
Upvotes: 0
Views: 615
Reputation: 8499
Try
Document doc = (Document) Jsoup.parse(new File(input), "UTF-8");
Elements paragraphs = ((Element) doc).select("td.text");
for (Element p : paragraphs) {
System.out.println(p.text());
Elements links = p.getElementsByTag("a");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
System.out.println("\n\n" + linkText + ">\r\n" + linkHref + "\r\n");
}
}
Upvotes: 1