tgai
tgai

Reputation: 1115

Best way to parse HTML table

I am interested in parsing the following table and others like it: http://www.cityofames.org/ftp/routes/Fall/wdreds&w.html

Any suggestions on the best tool for the job? After searching around I can't decide what I should use and would like to get some feedback before committing to something.

I am open to any languages/tools.

Upvotes: 0

Views: 483

Answers (3)

HTML is too difficult to be understood by any parser. You need to first convert this to a reasonably close XML format(for wellformedness- means tags that are matched) like XHTML using a program like tidy(http://tidy.sourceforge.net/). You can then use a XML/XHTML parser to parse the wellformed XML. Note that you will have to process your data based on font styles and convert the tags based on font styles to an array of times.

Here is what you can do when parsing

start TR element
  --Create Array
 start b element
  -- Add One time
 end b element
 start b element
  -- Add second time
 end b element
end TR element        

Upvotes: 1

fifo
fifo

Reputation: 11

With lynx I can do:

$ lynx -dump http://www.cityofames.org/ftp/routes/Fall/wdreds\&w.html
    6:25  6:31  6:36  6:41 -----  6:46  6:50      6:56
    7:02  7:08  7:14  7:20 -----  7:26  7:30      7:36
   ----- ----- ----- -----  7:38  7:43  7:47      7:53 1A
    7:28  7:35  7:42  7:48 -----  7:56  8:00      8:06
   ----- ----- ----- -----  7:58  8:03  8:07      8:13 1A
...

becomes very easy to parse with scripting language of choice, html2text may also work(never used it).

You could also play around with grep/sed to format it.

Upvotes: 1

Umer Hayat
Umer Hayat

Reputation: 2001

If you are looking for an HTML parser, there are number of options in Java:

You might also want to go through a very comprehensive discussion on pros and cons of using each of these here.

Upvotes: 1

Related Questions