Reputation: 739
I want to write a small piece of code that will exctract the "Kategorie" out of a href with jsoup.
<a href="/wiki/Kategorie:Herrscher_des_Mittelalters" title="Kategorie:Herrscher des Mittelalters">Herrscher des Mittelalters</a>
In this case I am searching for Herrscher des Mittelalters
.
My code reads the first line of a .txt file with the BufferedReader
.
BufferedReader r = new BufferedReader(new InputStreamReader(new FileInputStream(new File(FilePath)), Charset.forName("UTF-8")));
Document doc = Jsoup.parse(r.readLine());
Element elem = doc;
I know there are commands to get the href-link but I don't know commands to search for elements in the href-link.
Any suggestions?
Additional information: My .txt file contains full Wikipedia HTML pages.
Upvotes: 2
Views: 725
Reputation: 17701
This should get you all titles from links. You can split the titles further as you need:
Document d = Jsoup.parse("<a href=\"/wiki/Kategorie:Herrscher_des_Mittelalters\" title=\"Kategorie:Herrscher des Mittelalters\">Herrscher des Mittelalters</a>");
Elements links = d.select("a");
Set<String> categories = new HashSet<>();
for (Element script : links) {
String title = script.attr("title");
if (title.length() > 0) {
categories.add(title);
}
}
System.out.println(categories);
Upvotes: 1
Reputation: 22422
You can use getElementsContainingText() method (org.jsoup.nodes.Document) to search for elements with with any text.
Elements elements = doc.getElementsContainingText("Herrscher des Mittelalters");
for(int i=0; i<elements.size();i++) {
Element element = elements.get(i);
System.out.println(element.text());
}
Upvotes: 0