Reputation: 17057
I have a file with a line like this:
<div class="cell contentCell bbActiveRow" tabindex="-1" style="width: 150px; left: 77px; display: block;" cellposition="15,2"><div class="cell contentCell bbActiveRow last-child" tabindex="-1" style="width: 150px; left: 697px; display: block;" cellposition="15,6">159</div></div><div class="contentRow bb_row" rowindex="16" style="display: block; top: 429px;"><div class="cell first-child " title="Go to box" tabindex="-1" role="linkAction" cellposition="16,0"><span class="pre-child" style="background-color:#16A765;"> </span><span class="link" role="link"> </span></div>
The important bit I want to catch is the 159 in:
,6">159</div>
I can catch it fine with grep:
cat c |grep ',6\">[0-9]\+<'
Now, what I want to do, is actually catch the number itself (159) and print it out. Note that the actual file I have has several of those lines. Ideally, only the numbers will print out.
I thought I could do it with awk:
cat c | awk ' /,6\">([0-9]\+)/ { print $1 } '
But nope, nothing gets printed out. Having the regexp ready, and knowing that there are several lines in the file with entries that match the expression (with different numbers), how would you squeeze those numbers out?
Upvotes: 1
Views: 92
Reputation: 437082
A pragmatic approach:
cat c | grep -o ',6\">[0-9]\+<' | awk -F'<|>' '{ print $2 }'
-o
causes grep to only report the matching part of each line.awk -F'<|>' '{ print $2 }'
then extracts the token between >
and <
.As for why your awk
command didn't work:
awk
uses extended regular expressions, in which +
must NOT be escaped as \+
to be recognized as a quantifier.awk
splits by whitespace, so $2
will simply report the 2nd whitespace-separated token on each matching line, irrespective of the regular expression that caused the match.The solution at the top even finds multiple matches on a line, but if we assume that there's at most 1, it is relatively straightforward to do it all in awk
, if you have GNU awk
:
cat c | gawk '{ m=gensub(/^.*,6\">([0-9]+)<.*$/, "\\1", "1"); if (m != $0) print m }'
gensub()
replaces regex matches and returns the replacement, while crucially also supporting backreferences, which the POSIX sub()
and gsub()
functions do not.\1
), and stores the result in a variable. If the variable doesn't equal the input line, a match was captured, and it is printed.While a solution with POSIX awk
features only is possible (using match()
, RSTART
, RLENGTH
, split()
), it would be cumbersome.
Finally, if you have xmllint
(OS X does, and some Linux distros), consider guido's answer for a solution that performs actual HTML parsing and applies an XPath query, and is therefore more robust.
Upvotes: 2
Reputation: 19194
This oneliner is an alternate way to do that (using an xpath expression which matches div elements containing a cellposition attribute value ending with ',6'):
# xmllint --html test.html --xpath '//div[substring(@cellposition, string-length(@cellposition) - 1)=",6"]/text()'
159
Upvotes: 3