Reputation: 1251
I am parsing text weather data : http://www.nws.noaa.gov/view/prodsByState.php?state=OH&prodtype=hourly and want to only grab data for my county/area. The trick is that each text report has previous reports from earlier in the day and I'm only interested in the latest which appears towards the beginning of the file. I attempted to use the "print section of file between two regular expressions (inclusive)" from the sed one liners. I couldn't figure out how to get it to stop after one occurrence.
sed -n '/OHZ061/,/OHZ062/p' /tmp/weather.html
I found this: Sed print between patterns the first match result which works with the following
sed -n '/OHZ061/,$p;/OHZ062/q' /tmp/weather.html
but I feel like it isn't the most robust of solutions. I don't have anything to back up the statement of robustness but I have a gut feeling that there might be a more robust solution.
So are there any better solutions out there? Also is it possible to get my first attempted solution to work? And if you post a solution please give an explanation of all the switches/backreference/magic as I'm still trying to discover all the power of sed and command line tools.
And to help start you off:
wget -q "http://www.nws.noaa.gov/view/prodsByState.php?state=OH&prodtype=hourly" -O /tmp/weather.html
ps: I looked at this post:http://www.unix.com/shell-programming-scripting/167069-solved-sed-awk-print-between-patterns-first-occurrence.html but the sed was completely greek to me and I couldn't muddle through it to get it to work for my problem.
Upvotes: 1
Views: 1120
Reputation: 204228
sed is an excellent tool for simple substitutions on a single line. For anything else, just use awk:
awk '/OHZ061/{found=1} found{print; if(/OHZ062/) exit}' /tmp/weather.html
Upvotes: 1
Reputation: 36282
Not sed
because I don't like to parse HTML with that tool, but here you have a solution using perl
with the help of a HTML parser, HTML::TreeBuilder
. Code is commented step by step, I think it's easy to follow.
Content of script.pl
:
#!/usr/bin/env perl
use warnings;
use strict;
use HTML::TreeBuilder;
##
## Get content of the web page.
##
open my $fh, '-|', 'wget -q -O- "http://www.nws.noaa.gov/view/prodsByState.php?state=OH&prodtype=hourly"' or die;
##
## Parse content into a tree structure.
##
my $tree = HTML::TreeBuilder->new;
$tree->parse_file( $fh ) || die;
##
## Content is inside <pre>...</pre>, so search it in scalar context to get only
## the first one (the newest).
##
my $weather_data = $tree->find_by_tag_name( 'pre' )->as_text or die;
##
## Split data in "$$' and discard all tables of weather info but the first one.
##
my $last_weather_data = (split /(?m)^\$\$/, $weather_data, 2)[0];
##
## Remove all data until the pattern "OHZ + digits" found in the text
##
$last_weather_data =~ s/\A.*(OHZ\d{3}.*)\z/$1/s;
##
## Print result.
##
printf qq|%s\n|, $last_weather_data;
Run it like:
perl script.pl
And at 23:00 of 14-March-2013 it yields:
OHZ001>008-015>018-024>027-034-035-043-044-142300-
NORTHWEST OHIO
CITY SKY/WX TMP DP RH WIND PRES REMARKS
DEFIANCE MOSUNNY 41 18 39 W7G17 30.17F
FINDLAY SUNNY 39 21 48 W13 30.17F
TOLEDO EXPRESS SUNNY 41 19 41 W14 30.16F
TOLEDO METCALF MOSUNNY 42 21 43 W9 30.17S
LIMA MOSUNNY 38 22 52 W12 30.18S
Upvotes: 1