Reputation: 41
I wish to extract data between known HTML tags. For example:
Hello, <i>I<i> am <i>very</i> glad to meet you.
Should become:
'I
very'
So I have found something that works to nearly do this. Unfortunately, it only extracts the last entry.
sed -n -e 's/.*<i>\(.*\)<\/i>.*/\1/p'
Now I can append any end tag </i>
with a newline character and this works fine. But is there a way to do it with just one sed command?
Upvotes: 4
Views: 16699
Reputation: 359955
Give this a try:
sed -n 's|[^<]*<i>\([^<]*\)</i>[^<]*|\1\n|gp'
And your example is missing a "/":
Hello, <i>I</i> am <i>very</i> glad to meet you.
Upvotes: 3
Reputation: 342313
$ awk -vFS="<.[^>]*>" '{for(i=2;i<=NF;i+=2)print $i}' file
I
very
Upvotes: 0