Reputation: 1472

Awk/etc.: Extract Matches from File

I have an HTML file and would like to extract the text between <li> and </li> tags. There are of course a million ways to do this, but I figured it would be useful to get more into the habit of doing this in simple shell commands:

awk '/<li[^>]+><a[^>]+>([^>]+)<\/a>/m' cities.html

The problem is, this prints everything whereas I simply want to print the match in parenthesis -- ([^>]+) -- either awk doesn't support this, or I'm incompetent. The latter seems more likely. If you wanted to apply the supplied regex to a file and extract only the specified matches, how would you do it? I already know a half dozen other ways, but I don't feel like letting awk win this round ;)

Edit: The data is not well-structured, so using positional matches ($1, $2, etc.) is a no-go.

Upvotes: 0

Answers (5)

Everett Toews

Reputation: 10956

gawk -F'<li>' -v RS='</li>' 'RT{print $NF}' file

Worked pretty well for me.

Upvotes: 1

Hirofumi Saito

Reputation: 321

By your script, if you can get what you want (it means <li> and <a> tag is in one line.);

$ cat test.html | awk 'sub(/<li[^>]*><a[^>]*>/,"")&&sub(/<\/a>.*/,"")'

$ cat test.html | gawk '/<li[^>]*><a[^>]*>(.*?)<\/a>.*/&&$0=gensub(/<li[^>]*><a[^>]*>(.*?)<\/a>.*/,"\\1", 1)'

First one is for every awk, second one is for gnu awk.

Upvotes: 0

Roboprog

Reputation: 3144

Don't really know awk, how about Perl instead?

tr -d '\012' the.html | perl \
-e '$text = <>;' -e 'while ( length( $text) > 0)' \
-e '{ $text =~ /<li>(.*?)<\/li>(.*)/; $target = $1; $text = $2; print "$target\n" }'

1) remove newlines from file, pipe through perl

2) initialize a variable with the complete text, start a loop until text is gone

3) do a "non greedy" match for stuff bounded by list-item tags, save and print the target, set up for next pass

Make sense? (warning, did not try this code myself, need to go home soon...)

P.S. - "perl -n" is Awk (nawk?) mode. Perl is largely a superset of Awk, so I never bothered to learn Awk.

Upvotes: 0

Jonathan Leffler

Reputation: 753805

There are several issues that I see:

The pattern has a trailing 'm' which is significant for multi-line matches in Perl, but Awk does not use Perl-compatible regular expressions. (At least, standard (non-GNU) awk does not.)
Ignoring that, the pattern seems to search for a 'start list item' followed by an anchor '<a>' to '</a>', not the end list item.
You search for anything that is not a '>' as the body of the anchor; that's not automatically wrong, but it might be more usual to search for anything that is not '<', or anything that is neither.
Awk does not do multi-line searches.
In Awk, '$1' denotes the first field, where the fields are separated by the field separator characters, which default to white space.
In classic nawk (as documented in the 'sed & awk' book vintage 1991) does not have a mechanism in place for pulling sub-fields out of matches, etc.

It is not clear that Awk is the right tool for this job. Indeed, it is not entirely clear that regular expressions are the right tool for this job.

Upvotes: 0

Eddie

Reputation: 54421

If you want to do this in the general case, where your list tags can contain any legal HTML markup, then awk is the wrong tool. The right tool for the job would be an HTML parser, which you can trust to get correct all of the little details of HTML parsing, including variants of HTML and malformed HTML.

If you are doing this for a special case, where you can control the HTML formatting, then you may be able to make awk work for you. For example, let's assume you can guarantee that each list element never occupies more than one line, is always terminated with </li> on the same line, never contains any markup (such as a list that contains a list), then you can use awk to do this, but you need to write a whole awk program that first finds lines that contain list elements, then uses other awk commands to find just the substring you are interested in.

But in general, awk is the wrong tool for this job.

Upvotes: 2

Awk/etc.: Extract Matches from File

Answers (5)

Related Questions