Reputation: 1472
I have an HTML file and would like to extract the text between <li>
and </li>
tags. There are of course a million ways to do this, but I figured it would be useful to get more into the habit of doing this in simple shell commands:
awk '/<li[^>]+><a[^>]+>([^>]+)<\/a>/m' cities.html
The problem is, this prints everything whereas I simply want to print the match in parenthesis -- ([^>]+)
-- either awk doesn't support this, or I'm incompetent. The latter seems more likely. If you wanted to apply the supplied regex to a file and extract only the specified matches, how would you do it? I already know a half dozen other ways, but I don't feel like letting awk
win this round ;)
Edit: The data is not well-structured, so using positional matches ($1, $2, etc.
) is a no-go.
Upvotes: 0
Views: 4787
Reputation: 10956
gawk -F'<li>' -v RS='</li>' 'RT{print $NF}' file
Worked pretty well for me.
Upvotes: 1
Reputation: 321
By your script, if you can get what you want (it means <li>
and <a>
tag is in one line.);
$ cat test.html | awk 'sub(/<li[^>]*><a[^>]*>/,"")&&sub(/<\/a>.*/,"")'
or
$ cat test.html | gawk '/<li[^>]*><a[^>]*>(.*?)<\/a>.*/&&$0=gensub(/<li[^>]*><a[^>]*>(.*?)<\/a>.*/,"\\1", 1)'
First one is for every awk, second one is for gnu awk.
Upvotes: 0
Reputation: 3144
Don't really know awk, how about Perl instead?
tr -d '\012' the.html | perl \
-e '$text = <>;' -e 'while ( length( $text) > 0)' \
-e '{ $text =~ /<li>(.*?)<\/li>(.*)/; $target = $1; $text = $2; print "$target\n" }'
1) remove newlines from file, pipe through perl
2) initialize a variable with the complete text, start a loop until text is gone
3) do a "non greedy" match for stuff bounded by list-item tags, save and print the target, set up for next pass
Make sense? (warning, did not try this code myself, need to go home soon...)
P.S. - "perl -n" is Awk (nawk?) mode. Perl is largely a superset of Awk, so I never bothered to learn Awk.
Upvotes: 0
Reputation: 753805
There are several issues that I see:
<a>
' to '</a>
', not the end list item.>
' as the body of the anchor; that's not automatically wrong, but it might be more usual to search for anything that is not '<
', or anything that is neither.$1
' denotes the first field, where the fields are separated by the field separator characters, which default to white space.nawk
(as documented in the 'sed & awk
' book vintage 1991) does not have a mechanism in place for pulling sub-fields out of matches, etc.It is not clear that Awk is the right tool for this job. Indeed, it is not entirely clear that regular expressions are the right tool for this job.
Upvotes: 0
Reputation: 54421
If you want to do this in the general case, where your list tags can contain any legal HTML markup, then awk
is the wrong tool. The right tool for the job would be an HTML parser, which you can trust to get correct all of the little details of HTML parsing, including variants of HTML and malformed HTML.
If you are doing this for a special case, where you can control the HTML formatting, then you may be able to make awk
work for you. For example, let's assume you can guarantee that each list element never occupies more than one line, is always terminated with </li>
on the same line, never contains any markup (such as a list that contains a list), then you can use awk
to do this, but you need to write a whole awk
program that first finds lines that contain list elements, then uses other awk
commands to find just the substring you are interested in.
But in general, awk
is the wrong tool for this job.
Upvotes: 2