David W.
David W.

Reputation: 107040

Removing all HTML tags from a webpage

I am doing some BASH shell scripting with curl. If my curl command returns any text, I know I have an error. This text returned by curl is usually in HTML. I figured that if I can strip out all of the HTML tags, I could display the resulting text as an error message.

I was thinking of something like this:

sed -E 's/<.*?>//g' <<<$output_text

But I get sed: 1: "s/<.*?>//": RE error: repetition-operator operand invalid

If I replace *? with *, I don't get the error (and I don't get any text either). If I remove the global (g) flag, I get the same error.

This is on Mac OS X.

Upvotes: 6

Views: 12887

Answers (4)

محسن عباسی
محسن عباسی

Reputation: 2434

If you want to remove all HTML tags and also all script tags (and their contents), you can use the following:

sed 's/<script>.*<\/script>//g;/<script>/,/<\/script>/{/<script>/!{/<\/script>/!d}};s/<script>.*//g;s/.*<\/script>//g' $file -i && sed '/</ {:k s/<[^>]*>//g; /</ {N; bk}}' $file -i && sed -r '/^\s*$/d' $file -i

Upvotes: 0

clt60
clt60

Reputation: 63912

Maybe parser-based perl solution?

perl -0777 -MHTML::Strip -nlE 'say HTML::Strip->new->parse($_)' file.html

You must install the HTML::Strip module with cpan HTML::Strip command.

alternatively

you can use an standard OS X utility called: textutil see the man page

textutil -convert txt file.html

will produce file.txt with stripped html tags, or

textutil -convert txt -stdin -stdout < file.txt | some_command

Another alternative

Some systems get installed the lynx text-only browser. You can use the:

lynx -dump file.html #or
lynx -stdin -dump < file.html

But in your case, you can rely only on pure sed or awk solutions... IMHO.

But, if you have perl (and only haven't the HTML::Strip module) the next is still better as sed

perl -0777 -pe 's/<.*?>//sg'

because will remove the next (multiline and common) tag too:

<a
 href="#"
 class="some"
>link text</a>

Upvotes: 5

captcha
captcha

Reputation: 3756

Code for GNU :

sed '/</ {:k s/<[^>]*>//g; /</ {N; bk}}' file

This might fail, you should better use a tool.

Upvotes: 1

Kent
Kent

Reputation: 195059

sed doesn't support non-greedy.

try

's/<[^>]*>//g'

Upvotes: 8

Related Questions