Reputation: 37
I am now trying to implement a simple HTML webpage scraper using Java.Now I have a small problem. Suppose I have the following HTML fragment.
<div id="sr-h-left" class="sr-comp">
<a class="link-gray-underline" id="compare_header" rel="nofollow" href="javascript:i18nCompareProd('/serv/main/buyer/ProductCompare.jsp?nxtg=41980a1c051f-0942A6ADCF43B802');">
<span style="cursor: pointer;" class="sr-h-o">Compare</span>
</a>
</div>
<div id="sr-h-right" class="sr-summary">
<div id="sr-num-results">
<div class="sr-h-o-r">Showing 1 - 30 of 1,439 matches,
The data I am interested is the integer 1.439 shown at the bottom.I am just wondering how can I get that integer out of the HTML. I am now considering using a regular expression,and then use the java.util.Pattern to help get the data out,but still not very clear about the process. I would be grateful if you guys could give me some hint or idea on this data scraping. Thanks a lot.
Upvotes: 1
Views: 1527
Reputation: 175315
Regular expressions are probably the best way to do it. Something like:
Pattern p = Pattern.compile("Showing [0-9,]+ - [0-9,]+ of ([0-9,]+) matches");
Matcher m = p.matches(scrapedHTML);
if(m.matches()) {
int num = Integer.parseInt(m.group(1).replaceAll(",", ""));
// num == 1439
}
I'm not sure what you meant by understanding the "process", but here's what that code does: p
is a regular expression pattern that matches the "Showing..." line. m
is the result of applying that pattern to the scraped HTML. If m.matches()
is true it means the pattern matched the HTML, and m.group(1)
will be the first regular expression group (expression in parentheses) in the pattern, which was ([0-9,]+)
, which matches a string of digits and commas, so it'll be "1,459". The replaceAll()
call turns that into "1459", and Integer.parseInt()
turns that into the integer 1459
Upvotes: 2
Reputation: 1108537
Use a HTML parser to get that piece and then use regex to get rid of the part until with "of" and the part from "matches" and on. Here's an SSCCE with help of HtmlUnit:
package com.stackoverflow.q2615727;
import java.text.NumberFormat;
import java.util.Locale;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class Test {
public static void main(String... args) throws Exception {
WebClient client = new WebClient();
HtmlPage page = client.getPage("http://www.google.com/search?q=html+parser");
HtmlElement results = page.getElementById("resultStats"); // <div id="resultStats">
String text = results.asText(); // Results 1 - 10 of about 2,050,000 for html parser. (0.18 seconds)
String total = text.replaceAll("^(.*about)|(for.*)$", "").trim(); // 2,050,000
Long l = (Long) NumberFormat.getInstance(Locale.ENGLISH).parse(total); // 2050000
System.out.println(l);
}
}
In your specific case you may want to replace only the URL and the following two lines in:
HtmlElement results = page.getElementById("sr-num-results"); // <div id="sr-num-results">
and
String total = text.replaceAll("^(.*of)|(matches.*)$", "").trim(); // 1,439
Upvotes: 1
Reputation: 10444
Using a regular expression to parse the text is one possibility. Sometimes too, the specific text you need is in a specific div in the DOM hiearchy so you can use an xpath expression to find what you need. Sometimes you want to look for divs of a specific class. It depends on the specific HTML. In addition to regular expressions, a good HTML parser will come in handy. I've used Jericho HTML, but there are many others out there.
Upvotes: 1