aintnoprophet
aintnoprophet

Reputation: 489

Read in html table to java

I need to pull data from an html page using Java code. The java part is required.

The page i am trying to pull info from is http://www.weather.gov/data/obhistory/KMCI.html .

I need to create a list of hashmaps...or some kind of data object that i can reference in later code.

This is all i have so far:

URL weatherDataKC = new URL("http://www.weather.gov/data/obhistory/KMCI.html");
InputStream is = weatherDataKC.openStream();
int cnt = 0;
StringBuffer buffer = new StringBuffer();

while ((cnt = is.read()) != -1){
    buffer.append((char) cnt);
}

System.out.print(buffer.toString());

Any suggestions where to start?

Upvotes: 1

Views: 4818

Answers (4)

Marian
Marian

Reputation: 2667

J2SE includes HTML parsing capabilities, in packages javax.swing.text.html and javax.swing.text.html.parser. HTMLEditorKit.ParserCallback receives events pushed by DocumentParser (better be used through ParserDelegator). The framework is very similar to the SAX parsers for XML.

Beware, there are some bugs. It won't be able to handle bad HTML very well.


Dealing with colspan and rowspan is your business.

Upvotes: 1

DisgruntledGoat
DisgruntledGoat

Reputation: 72510

HTML scraping is notoriously difficult, unless you have a lot of "hooks" like unique IDs. For example, the table you want starts with this HTML:

<table cellspacing="3" cellpadding="2" border="0" width="670">

...which is very generic and may match several tables on the page. The other problem is, what happens if the HTML structure changes? You'll have to redefine all your parsing rules...

Upvotes: 0

dfa
dfa

Reputation: 116334

there is a nice HTML parser called Neko:

NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces. The parser can scan HTML files and "fix up" many common mistakes that human (and computer) authors make in writing HTML documents. NekoHTML adds missing parent elements; automatically closes elements with optional end tags; and can handle mismatched inline element tags.

More information here.

Upvotes: 3

Damo
Damo

Reputation: 11540

Use an HTML parser like CyberNeko

Upvotes: 2

Related Questions