Reputation: 1115
I am interested in parsing the following table and others like it: http://www.cityofames.org/ftp/routes/Fall/wdreds&w.html
Any suggestions on the best tool for the job? After searching around I can't decide what I should use and would like to get some feedback before committing to something.
I am open to any languages/tools.
Upvotes: 0
Views: 483
Reputation: 18642
HTML is too difficult to be understood by any parser. You need to first convert this to a reasonably close XML format(for wellformedness- means tags that are matched) like XHTML using a program like tidy(http://tidy.sourceforge.net/). You can then use a XML/XHTML parser to parse the wellformed XML. Note that you will have to process your data based on font styles and convert the tags based on font styles to an array of times.
Here is what you can do when parsing
start TR element
--Create Array
start b element
-- Add One time
end b element
start b element
-- Add second time
end b element
end TR element
Upvotes: 1
Reputation: 11
With lynx I can do:
$ lynx -dump http://www.cityofames.org/ftp/routes/Fall/wdreds\&w.html
6:25 6:31 6:36 6:41 ----- 6:46 6:50 6:56
7:02 7:08 7:14 7:20 ----- 7:26 7:30 7:36
----- ----- ----- ----- 7:38 7:43 7:47 7:53 1A
7:28 7:35 7:42 7:48 ----- 7:56 8:00 8:06
----- ----- ----- ----- 7:58 8:03 8:07 8:13 1A
...
becomes very easy to parse with scripting language of choice, html2text
may also work(never used it).
You could also play around with grep/sed to format it.
Upvotes: 1