Merc
Merc

Reputation: 17057

Find a regexp in awk

I have a file with a line like this:

<div class="cell contentCell bbActiveRow" tabindex="-1" style="width: 150px; left: 77px; display: block;" cellposition="15,2"><div class="cell contentCell bbActiveRow last-child" tabindex="-1" style="width: 150px; left: 697px; display: block;" cellposition="15,6">159</div></div><div class="contentRow bb_row" rowindex="16" style="display: block; top: 429px;"><div class="cell first-child " title="Go to box" tabindex="-1" role="linkAction" cellposition="16,0"><span class="pre-child" style="background-color:#16A765;">&nbsp;</span><span class="link" role="link">&nbsp;</span></div>

The important bit I want to catch is the 159 in:

,6">159</div>

I can catch it fine with grep:

cat c |grep  ',6\">[0-9]\+<'

Now, what I want to do, is actually catch the number itself (159) and print it out. Note that the actual file I have has several of those lines. Ideally, only the numbers will print out.

I thought I could do it with awk:

cat c | awk ' /,6\">([0-9]\+)/ { print $1 } '

But nope, nothing gets printed out. Having the regexp ready, and knowing that there are several lines in the file with entries that match the expression (with different numbers), how would you squeeze those numbers out?

Upvotes: 1

Views: 92

Answers (2)

mklement0
mklement0

Reputation: 437082

A pragmatic approach:

cat c | grep -o ',6\">[0-9]\+<' | awk -F'<|>' '{ print $2 }'
  • -o causes grep to only report the matching part of each line.
  • awk -F'<|>' '{ print $2 }' then extracts the token between > and <.

As for why your awk command didn't work:

  • awk uses extended regular expressions, in which + must NOT be escaped as \+ to be recognized as a quantifier.
  • Even with that fixed, the command wouldn't work, because, by default, awk splits by whitespace, so $2 will simply report the 2nd whitespace-separated token on each matching line, irrespective of the regular expression that caused the match.

The solution at the top even finds multiple matches on a line, but if we assume that there's at most 1, it is relatively straightforward to do it all in awk, if you have GNU awk:

cat c | gawk '{ m=gensub(/^.*,6\">([0-9]+)<.*$/, "\\1", "1"); if (m != $0) print m }'    
  • The non-POSIX gensub() replaces regex matches and returns the replacement, while crucially also supporting backreferences, which the POSIX sub() and gsub() functions do not.
  • The above matches the entire line, then replaces it with the captured number only (via (escaped) backreference \1), and stores the result in a variable. If the variable doesn't equal the input line, a match was captured, and it is printed.

While a solution with POSIX awk features only is possible (using match(), RSTART, RLENGTH, split()), it would be cumbersome.


Finally, if you have xmllint (OS X does, and some Linux distros), consider guido's answer for a solution that performs actual HTML parsing and applies an XPath query, and is therefore more robust.

Upvotes: 2

guido
guido

Reputation: 19194

This oneliner is an alternate way to do that (using an xpath expression which matches div elements containing a cellposition attribute value ending with ',6'):

# xmllint --html test.html --xpath '//div[substring(@cellposition, string-length(@cellposition) - 1)=",6"]/text()' 
159

Upvotes: 3

Related Questions