N Klosterman
N Klosterman

Reputation: 1251

Limit Sed print section of file btw 2 regexp to first occurrence

I am parsing text weather data : http://www.nws.noaa.gov/view/prodsByState.php?state=OH&prodtype=hourly and want to only grab data for my county/area. The trick is that each text report has previous reports from earlier in the day and I'm only interested in the latest which appears towards the beginning of the file. I attempted to use the "print section of file between two regular expressions (inclusive)" from the sed one liners. I couldn't figure out how to get it to stop after one occurrence.

sed -n '/OHZ061/,/OHZ062/p' /tmp/weather.html

I found this: Sed print between patterns the first match result which works with the following

sed -n '/OHZ061/,$p;/OHZ062/q' /tmp/weather.html

but I feel like it isn't the most robust of solutions. I don't have anything to back up the statement of robustness but I have a gut feeling that there might be a more robust solution.

So are there any better solutions out there? Also is it possible to get my first attempted solution to work? And if you post a solution please give an explanation of all the switches/backreference/magic as I'm still trying to discover all the power of sed and command line tools.

And to help start you off:

wget -q "http://www.nws.noaa.gov/view/prodsByState.php?state=OH&prodtype=hourly" -O /tmp/weather.html

ps: I looked at this post:http://www.unix.com/shell-programming-scripting/167069-solved-sed-awk-print-between-patterns-first-occurrence.html but the sed was completely greek to me and I couldn't muddle through it to get it to work for my problem.

Upvotes: 1

Views: 1120

Answers (2)

Ed Morton
Ed Morton

Reputation: 204228

sed is an excellent tool for simple substitutions on a single line. For anything else, just use awk:

awk '/OHZ061/{found=1} found{print; if(/OHZ062/) exit}' /tmp/weather.html

Upvotes: 1

Birei
Birei

Reputation: 36282

Not sed because I don't like to parse HTML with that tool, but here you have a solution using perl with the help of a HTML parser, HTML::TreeBuilder. Code is commented step by step, I think it's easy to follow.

Content of script.pl:

#!/usr/bin/env perl

use warnings;
use strict;
use HTML::TreeBuilder;

##
## Get content of the web page.
##
open my $fh, '-|', 'wget -q -O- "http://www.nws.noaa.gov/view/prodsByState.php?state=OH&prodtype=hourly"' or die;

##
## Parse content into a tree structure.
##
my $tree = HTML::TreeBuilder->new;
$tree->parse_file( $fh ) || die;

## 
## Content is inside <pre>...</pre>, so search it in scalar context to get only
## the first one (the newest).
##
my $weather_data = $tree->find_by_tag_name( 'pre' )->as_text or die;

##
## Split data in "$$' and discard all tables of weather info but the first one.
##
my $last_weather_data = (split /(?m)^\$\$/, $weather_data, 2)[0];

## 
## Remove all data until the pattern "OHZ + digits" found in the text
##
$last_weather_data =~ s/\A.*(OHZ\d{3}.*)\z/$1/s;

## 
## Print result.
##
printf qq|%s\n|, $last_weather_data;

Run it like:

perl script.pl

And at 23:00 of 14-March-2013 it yields:

OHZ001>008-015>018-024>027-034-035-043-044-142300-
   NORTHWEST OHIO

CITY           SKY/WX    TMP DP  RH WIND       PRES   REMARKS
DEFIANCE       MOSUNNY   41  18  39 W7G17     30.17F
FINDLAY        SUNNY     39  21  48 W13       30.17F
TOLEDO EXPRESS SUNNY     41  19  41 W14       30.16F
TOLEDO METCALF MOSUNNY   42  21  43 W9        30.17S
LIMA           MOSUNNY   38  22  52 W12       30.18S

Upvotes: 1

Related Questions