Reputation: 11
how can I remove links from a raw html text? I've got:
Foo bar <a href="http://www.foo.com">blah</a> bar foo
and want to get:
Foo bar blah bar foo
afterwards.
Upvotes: 1
Views: 911
Reputation: 342303
$ echo 'Foo bar <a href="http://www.foo.com">blah</a> bar foo' | awk 'BEGIN{RS="</a>"}/<a href/{gsub(/<a href=\042.*\042>/,"")}1'
Foo bar blah bar foo
Upvotes: 0
Reputation: 14291
sed -re 's|<a [^>]*>([^<]*)</a>|\1|g'
But Brian's answer is right: This should only be used in very simple cases.
Upvotes: 2
Reputation: 272237
You're looking to parse HTML with regexps, and this won't work in all but the simplest cases, since HTML isn't regular. A much more reliable solution is to use an HTML parser. Numerous exist, for many different languages.
Upvotes: 2