michste93
michste93

Reputation: 423

Sed remove tags from html file

I need to remove all tags from a html with a bash script using the sed command. I tried with this

sed -r 's/[\<][\/]?[a-zA-Z0-9\=\"\-\#\.\& ]+[\/]?[\>]//g' $1

and whith this

sed -r 's/[\<][\/]?[.]*[\/]?[\\]?[\>]//g' $1

but I still miss something, any suggestions??

Upvotes: 40

Views: 83701

Answers (4)

Sridhar Sarnobat
Sridhar Sarnobat

Reputation: 25236

I know the OP asked for sed specifically but this page shows as a top result in Google even for non-sed searchers.

Perl one liner

cat - | perl -pe 's{\n}{ }g' | perl -pe 's{>}{>\n}g' | perl -pe 's{<}{\n<}g' | grep -v '<' | grep -v '^\s*$'

Feel free to edit this (I've marked it as community wiki), it's not perfect.

Explanation

Too much for me to type for now, but explainshell.com is a start.

Other notes

I'm surprised there isn't a mature tool out there that does this, just lots of messy npm command line tools. I'm not a fan of the amount of junk npm leaves behind. A golang precompiled single binary or something via brew install would be the ultimate dream

Upvotes: 1

Olaf Dietsche
Olaf Dietsche

Reputation: 74028

You can either use one of the many HTML to text converters, use Perl regex if possible <.+?> or if it must be sed use <[^>]*>

sed -e 's/<[^>]*>//g' file.html

If there's no room for errors, use an HTML parser instead. E.g. when an element is spread over two lines

<div
>Lorem ipsum</div>

this regular expression will not work.


This regular expression consists of three parts <, [^>]*, >

  • search for opening <
  • followed by zero or more characters *, which are not the closing >
    [...] is a character class, when it starts with ^ look for characters not in the class
  • and finally look for closing >

The simpler regular expression <.*> will not work, because it searches for the longest possible match, i.e. the last closing > in an input line. E.g., when you have more than one tag in an input line

<name>Olaf</name> answers questions.

will result in

answers questions.

instead of

Olaf answers questions.

See also Repetition with Star and Plus, especially section Watch Out for The Greediness! and following, for a detailed explanation.

Upvotes: 109

mgutt
mgutt

Reputation: 6177

Far away from perfect, but for me it was sufficient:

curl -Ls https://stackoverflow.com | # load html content
tr -d '\r' |                         # remove carriage return
tr '\n' '\r' |                       # replace line breaks against carriage return to allow sed to replace across multiple lines
sed -E "s/\/(script|style)>/\n/g" |  # replace closing script/css tags against new line
sed -E "s/<(script|style).*//g" |    # replace whole script/css blocks ungreedy
sed -E 's/(="[^"]*)>/\1/g' |         # replace closing bracket inside of double quotes
sed -E "s/(='[^']*)>/\1/g" |         # replace closing bracket inside of single quotes
sed "s/<[^>]*>/ /g" |                # replace all other html tags against white space
tr '\r' '\n' |                       # replace carriage return against new line
tr '\t' ' ' |                        # replace tabulator against white space
tr -s ' ' |                          # reduce consecutive white space
sed "s/^ //g" |                      # remove white space from the beginning of each line
grep -v "^$"                         # remove empty lines

Returns:

Stack Overflow - Where Developers Learn, Share, &amp; Build Careers 
Stack Overflow 
About 
Products
For Teams 
Stack Overflow 
Public questions &amp; answers 
Stack Overflow for Teams 
Where developers &amp; technologists share private knowledge with coworkers 
Talent 
Build your employer brand
Advertising 
Reach developers &amp; technologists worldwide 
Labs 
The future of collective knowledge sharing 
About the company 
Loading&#x2026; 
current community 
Stack Overflow
help 
chat 
...
API
Data
Blog 
Facebook 
Twitter 
LinkedIn 
Instagram 
Site design / logo &#169; 2024 Stack Exchange Inc; user contributions licensed under CC BY-SA . rev&nbsp;2024.3.22.6753 

Upvotes: 0

Jonesy
Jonesy

Reputation: 1

I've often used lynx -dump -nolist <URL> for the OP's purpose. However, you still get formatting, so you might want to additionally strip leading blanks on each line.

Upvotes: 0

Related Questions