Reputation: 6651
<span class="cur_wind">with 3km/h SSW winds</span><hr class="hr_sm" /></td>
I want to extract the words "with 3km/h SSW winds" (note this string will change so hardcoding it wont work) from the line above using the 'grep' command. I have been trying for a long time and am completely lost. Any help would be appreciated.
Upvotes: 0
Views: 119
Reputation: 437080
Here's a GNU grep
solution that uses -P
to activate support for PCREs (Perl-Compatible Regular Expressions):
grep -Po '"cur_wind">\K[^<]+' \
<<<'<span class="cur_wind">with 3km/h SSW winds</span><hr class="hr_sm" /></td>'
-o
specifies that only the matching string be output\K
is a PCRE-feature that drops everything matched so far; this allows providing context for more specific matching without including that context in the match.Another option is to use a look-behind assertion in lieu of \K
:
grep -Po '(?<="cur_wind">)[^<]+' \
<<<'<span class="cur_wind">with 3km/h SSW winds</span><hr class="hr_sm" /></td>'
Of course, this kind of matching relies on the specific formatting of the input string (whitespace, single- vs. double-quoting, ordering of attributes, ... - in addition to the fundamental problem of grep
not understanding the structure of the data) and is thus fragile.
Thus, in general, as others have noted, grep
is the wrong tool for the job.
On OSX, assuming the input is XML (or XHTML), you can parse robustly with the stock xmllint
utility and an XPath expression:
xmllint --xpath '//span[@class="cur_wind"]/text()' - <<<\
'<td><span class="cur_wind">with 3km/h SSW winds</span><hr class="hr_sm" /></td>'
Here's a similar solution using a third-party utility, the multi-platform web-scraping utility xidel (which handles both HTML and XML):
xidel -q -e '//span[@class="cur_wind"]' - <<<\
'<td><span class="cur_wind">with 3km/h SSW winds</span><hr class="hr_sm" /></td>'
Upvotes: 2
Reputation: 37029
Try sed:
echo '<span class="cur_wind">with 3km/h SSW winds</span><hr class="hr_sm" /></td>' | sed -e 's/<[^>]*>//g'
Output
with 3km/h SSW winds
Explanation
echo 'whatever'
will echo the word whatever
to the screen (stdandard output aka stdout)|
symbol is a pipe. Command to the right of that will take the output from echo and do something with itsed
is stream editor. It's -e switch tells sed
to evaluate a script or expressions/xyz/abc/g
format is simple. s/ means substitute. /g means globally. Substitute all occurrences of xyz with abc globallys/<[^>]*>//g
gets interesting. Let's focus on <[^>]*>
. It means, substitute anything that starts with <, does not contain > immediately but contains any other character and then has > with empty<span class="cur_wind">
for example. That tag starts with <, then contains characters immediately after and then has a >. sed
says, when such text is found, chop it off (replace with empty)<hr>
and </td>
. What remains is the text you wantThis is a somewhat simplified explanation.
Upvotes: 1
Reputation: 295291
grep
doesn't know XML, and thus is the wrong tool for the job; use a real XML parser. One of the better ones easily accessible from bash is XMLStarlet.
xmlstarlet sel -t -m "//span[@class='cur_wind']/text()" -v . -n <input.xml
This extracts all text directly contained within a span of the class cur_wind
.
Upvotes: 1
Reputation: 16728
if that is all you want then cat | grep ".with 3km/h SSW winds." should do it, but i suspect there is more then that that you need
Upvotes: 0