Sundararaman P
Sundararaman P

Reputation: 361

printing lines based on pattern matching in multiple fields using awk

Suppose I have a html input like

<li>this is a html input line</li>

I want to filter all such input lines from a file which begins with <li> and ends with </li>. Now my idea was to search for pattern <li> in the first field and pattern </li> in the last field using the below awk command

awk '$1 ~ /\<li\>/ ; $NF ~ /\</li\>/ {print $0}'

but looks like there is no provision to match two fields at a time or I am making some syntax mistakes. Could you please help me here?

PS: I am working on a Solaris SunOS machine.

Upvotes: 1

Views: 663

Answers (2)

Eric Renouf
Eric Renouf

Reputation: 14490

Why not just use a regex to match the start and end of the line like

awk '/^[[:space:]]*<li>.*<\/li>[[:space:]]*$/ {print}'

though in general if you're trying to process HTML you'll be better of using a tool that's really designed to handle that.

Upvotes: 1

Ed Morton
Ed Morton

Reputation: 203229

There's a lot going wrong in your script on Solaris:

awk '$1 ~ /\<li\>/ ; $NF ~ /\</li\>/ {print $0}'
  1. The default awk on Solaris (and so the one we have to assume you are using since you didn't state otherwise) is old, broken awk which must never be used. On Solaris use /usr/xpg4/bin/awk. There's also nawk but it's got less POSIX features (eg. no support for character classes).
  2. \<...\> are gawk-specific word boundaries. There is no awk on Solaris that would recognize those. If you were just trying to get literal characters then there's no need to escape them as they are not regexp metacharacters.
  3. If you want to test for condition 1 and condition 2 you put && between them, not ; which is just the statement terminator in lieu of a newline.
  4. The default action given a true condition is {print $0} so you don't need to explicitly write that code.
  5. / is the awk regexp delimiter so you do need to escape that in mid-regexp.
  6. The default field separator is white space so in your posted sample input $1 and $NF will be <li>this and line</li>, not <li> and </li>.

So if you DID for some reason compare multiple fields you could do:

awk '($1 ~ /^<li>.*/) && ($NF ~ /.*<\/li>$/)'

but this is probably what you really want:

awk '/^<li>.*<\/li>/'

in which case you could just use grep:

grep '^<li>.*</li>'

Upvotes: 3

Related Questions