Reputation: 2949
I want to extract the text from the table http://www.amiriconstruction.co.uk/goodwoodgolf/scoretable.htm into a textile in plain text without html tags from the Mac OS X command line.
I tried a lot of sed commands, but sed will only print the whole file again. What am I doing wrong?
Example of what I tried
sed -n '/<tr>/,/<\/tr>/p' scoretable.htm
(will just print table contents with html tags :( )
Upvotes: 0
Views: 3300
Reputation: 9118
sed -n 's;</\?td>;;gp' scoretable.html | \
sed -e 's;<td class="center">;;' \
-e 's;<.*>;;'
Note that I use ;
instead of /
as my delimiter - I find it a bit easier to read. Sed will use whatever character you put after 's
as the delimiter.
Okay, now the explanation. The first line:
-n
will repress output, but the p
at the end of the command tells sed to specifically print all lines matching the pattern. This will get us only the lines wrapped in <td>
tags. At the same time, I'm finding anything that matches </\?td>
and substituting it with nothing. /\?
means /
must not appear or appear only once, so this will match both the opening and closing tags. The g
at the end, or global, means that it won't stop trying to match the pattern after it succeeds for the first time in a line. Without g
it would only substitute the opening tag.
The output from this is piped into sed again on the second line:
-e
just specifies that there is an editing command to run. If you're just running one command it's implied, but here I run two (the next one is on the third line).
This removes <td class="center">
, and the next line removes any other tags (in this case the <br>
tags.
The last command can only be run if you're sure that there's only at most one tag on a line. Otherwise, the .*
will be greedy and match too much, so in:
<td class="center">24 </ br>
it would match the entire line, and remove everything.
Upvotes: 2
Reputation: 58578
A little TXR web scraping, with the help of wget
to grab the page:
@(deffilter nobr ("<br />" ""))
@(deffilter brsp ("<br />" " "))
@(deffilter nosp (" " ""))
@(next "!wget 2>/dev/null -O - http://www.amiriconstruction.co.uk/goodwoodgolf/scoretable.htm")
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
@(skip)
<div class="scoreTableArea">
@(collect)
<h2 class="unify">@year - @event</h2>
@ (filter brsp event)
@ (collect)
<tr>
<td class="center">@pos</td>
<td>@player</td>
<td>@company</td>
<td>@date</td>
<td class="center">@points</td>
</tr>
@ (filter nobr player company date points)
@ (filter nosp pos points)
@ (until)
</tbody>
@ (end)
@(end)
@(output :filter :from_html)
@ (repeat)
Event: @event
Year: @year
DATE POS PT PLAYER COMPANY
@ (repeat)
@{date -10} @{pos -2} @{points 2} @{player 16} @company
@ (end)
@ (end)
@(end)
Sample run:
$ txr scoretable.txr
Event: Teeing off to Clobber Ken
Year: 2011
DATE POS PT PLAYER COMPANY
Sept 2011 1 40 John Durrant King Sumners Partnership
Sept 2011 2 34 Grahame Pettit Amiri Construction
Oct 2011 3 31 Tony Deacon Gleeds
Oct 2011 4 29 Tony Boyle Lacey Hickey Caley
Oct 2011 5 29 Richard Hemming Scott White and Hookins
Sept 2011 6 29 Ian McCoy Selway Joyce
June 2011 7 27 Julian Larkin C&G Properties
Sept 2011 8 25 Roque Menezes Capita Symonds
June 2011 9 22 Shawn Lambert PWP Architects
Sept 2011 10 22 Kevin Lendon Amiri Construction
Event: Ken Watson (HNW Architects) Undisputed Amiri Golf Demon of the Downs
Year: 2010
DATE POS PT PLAYER COMPANY
2010 1 40 Ken Watson HNW Architects
2010 2 37 David Heda London Clancy
2010 3 34 Gordon Brown Currie & Brown
2010 4 32 Alistair Taylor Wildbrook Properties
5 30 Andy Goodridge City Estates
6 25 Russ Pitman Henderson Green
7 24 Phil Piper Piper Whitlock
8 23 Kevin Miller Urban Pulse Architects
9 19 Simon Asquith Godsall Arnold Partnership
10 19 Shawn Lambert PWP Architects
11 18 Martin Judd Davis Langdon
Upvotes: 3