Reputation: 489
I need to pull data from an html page using Java code. The java part is required.
The page i am trying to pull info from is http://www.weather.gov/data/obhistory/KMCI.html .
I need to create a list of hashmaps...or some kind of data object that i can reference in later code.
This is all i have so far:
URL weatherDataKC = new URL("http://www.weather.gov/data/obhistory/KMCI.html");
InputStream is = weatherDataKC.openStream();
int cnt = 0;
StringBuffer buffer = new StringBuffer();
while ((cnt = is.read()) != -1){
buffer.append((char) cnt);
}
System.out.print(buffer.toString());
Any suggestions where to start?
Upvotes: 1
Views: 4818
Reputation: 2667
J2SE includes HTML parsing capabilities, in packages javax.swing.text.html and javax.swing.text.html.parser. HTMLEditorKit.ParserCallback receives events pushed by DocumentParser (better be used through ParserDelegator). The framework is very similar to the SAX parsers for XML.
Beware, there are some bugs. It won't be able to handle bad HTML very well.
Dealing with colspan and rowspan is your business.
Upvotes: 1
Reputation: 72510
HTML scraping is notoriously difficult, unless you have a lot of "hooks" like unique IDs. For example, the table you want starts with this HTML:
<table cellspacing="3" cellpadding="2" border="0" width="670">
...which is very generic and may match several tables on the page. The other problem is, what happens if the HTML structure changes? You'll have to redefine all your parsing rules...
Upvotes: 0
Reputation: 116334
there is a nice HTML parser called Neko:
NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces. The parser can scan HTML files and "fix up" many common mistakes that human (and computer) authors make in writing HTML documents. NekoHTML adds missing parent elements; automatically closes elements with optional end tags; and can handle mismatched inline element tags.
Upvotes: 3