Reputation: 2515
How could I use Jsoup to extract specification data from this website separately for each row e.g. Network->Network Type, Battery etc.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class mobilereviews {
public static void main(String[] args) throws Exception {
Document doc = Jsoup.connect("http://mobilereviews.net/details-for-Motorola%20L7.htm").get();
for (Element table : doc.select("table")) {
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
System.out.println(tds.get(0).text());
}
}
}
}
Upvotes: 3
Views: 28947
Reputation: 334
Here is a generic solution to extraction of table from HTML page via JSoup.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class ExtractTableDataUsingJSoup {
public static void main(String[] args) {
extractTableUsingJsoup("http://mobilereviews.net/details-for-Motorola%20L7.htm","phone_details");
}
public static void extractTableUsingJsoup(String url, String tableId){
Document doc;
try {
// need http protocol
doc = Jsoup.connect(url).get();
//Set id of any table from any website and the below code will print the contents of the table.
//Set the extracted data in appropriate data structures and use them for further processing
Element table = doc.getElementById(tableId);
Elements tds = table.getElementsByTag("td");
//You can check for nesting of tds if such structure exists
for (Element td : tds) {
System.out.println("\n"+td.text());
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
Upvotes: 1
Reputation: 428
this is how i get the data from a html table.
org.jsoup.nodes.Element tablaRegistros = doc
.getElementById("tableId");
for (org.jsoup.nodes.Element row : tablaRegistros.select("tr")) {
for (org.jsoup.nodes.Element column : row.select("td")) {
// Elements tds = row.select("td");
// cadena += tds.get(0).text() + "->" +
// tds.get(1).text()
// + " \n";
cadena += column.text() + ",";
}
cadena += "\n";
}
Upvotes: 1
Reputation: 11185
xpath for the columns - //*[@id="phone_details"]/tbody/tr[3]/td[2]/strong
xpath for the values - //*[@id="phone_details"]/tbody/tr[3]/td[3]
@Joey's code tries to zero in on these. You should be able to write the select()
rules based on the Xpath.
Replace the numbers (tr[N] / td[N]) with appropriate values.
Alternatively, you can pipe the HTML thought a text only browser and extract the data from the text. Here is the text version of the page. You can delimit the text or read after N chars to extract the data.
Upvotes: 3
Reputation: 1349
Here is an attempt to find the solution to your problem
Document doc = Jsoup.connect("http://mobilereviews.net/details-for-Motorola%20L7.htm").get();
for (Element table : doc.select("table[id=phone_details]")) {
for (Element row : table.select("tr:gt(2)")) {
Elements tds = row.select("td:not([rowspan])");
System.out.println(tds.get(0).text() + "->" + tds.get(1).text());
}
}
Parsing the HTML is tricky and if the HTML changes your code needs to change as well.
You need to study the HTML markup to come up with your parsing rules first.
table[id=phone_details]
tr:gt(2)
td:not([rowspan])
For more complex options in the selector syntax, look here http://jsoup.org/cookbook/extracting-data/selector-syntax
Upvotes: 6