Reputation: 423
I need to remove all tags from a html with a bash script using the sed command. I tried with this
sed -r 's/[\<][\/]?[a-zA-Z0-9\=\"\-\#\.\& ]+[\/]?[\>]//g' $1
and whith this
sed -r 's/[\<][\/]?[.]*[\/]?[\\]?[\>]//g' $1
but I still miss something, any suggestions??
Upvotes: 40
Views: 83701
Reputation: 25236
I know the OP asked for sed
specifically but this page shows as a top result in Google even for non-sed searchers.
cat - | perl -pe 's{\n}{ }g' | perl -pe 's{>}{>\n}g' | perl -pe 's{<}{\n<}g' | grep -v '<' | grep -v '^\s*$'
Feel free to edit this (I've marked it as community wiki), it's not perfect.
Too much for me to type for now, but explainshell.com is a start.
I'm surprised there isn't a mature tool out there that does this, just lots of messy npm command line tools. I'm not a fan of the amount of junk npm leaves behind. A golang precompiled single binary or something via brew install
would be the ultimate dream
Upvotes: 1
Reputation: 74028
You can either use one of the many HTML to text converters, use Perl regex if possible <.+?>
or if it must be sed
use <[^>]*>
sed -e 's/<[^>]*>//g' file.html
If there's no room for errors, use an HTML parser instead. E.g. when an element is spread over two lines
<div
>Lorem ipsum</div>
this regular expression will not work.
This regular expression consists of three parts <
, [^>]*
, >
<
*
, which are not the closing >
[...]
is a character class, when it starts with ^
look for characters not in the class>
The simpler regular expression <.*>
will not work, because it searches for the longest possible match, i.e. the last closing >
in an input line. E.g., when you have more than one tag in an input line
<name>Olaf</name> answers questions.
will result in
answers questions.
instead of
Olaf answers questions.
See also Repetition with Star and Plus, especially section Watch Out for The Greediness! and following, for a detailed explanation.
Upvotes: 109
Reputation: 6177
Far away from perfect, but for me it was sufficient:
curl -Ls https://stackoverflow.com | # load html content
tr -d '\r' | # remove carriage return
tr '\n' '\r' | # replace line breaks against carriage return to allow sed to replace across multiple lines
sed -E "s/\/(script|style)>/\n/g" | # replace closing script/css tags against new line
sed -E "s/<(script|style).*//g" | # replace whole script/css blocks ungreedy
sed -E 's/(="[^"]*)>/\1/g' | # replace closing bracket inside of double quotes
sed -E "s/(='[^']*)>/\1/g" | # replace closing bracket inside of single quotes
sed "s/<[^>]*>/ /g" | # replace all other html tags against white space
tr '\r' '\n' | # replace carriage return against new line
tr '\t' ' ' | # replace tabulator against white space
tr -s ' ' | # reduce consecutive white space
sed "s/^ //g" | # remove white space from the beginning of each line
grep -v "^$" # remove empty lines
Returns:
Stack Overflow - Where Developers Learn, Share, & Build Careers
Stack Overflow
About
Products
For Teams
Stack Overflow
Public questions & answers
Stack Overflow for Teams
Where developers & technologists share private knowledge with coworkers
Talent
Build your employer brand
Advertising
Reach developers & technologists worldwide
Labs
The future of collective knowledge sharing
About the company
Loading…
current community
Stack Overflow
help
chat
...
API
Data
Blog
Facebook
Twitter
LinkedIn
Instagram
Site design / logo © 2024 Stack Exchange Inc; user contributions licensed under CC BY-SA . rev 2024.3.22.6753
Upvotes: 0
Reputation: 1
I've often used lynx -dump -nolist <URL>
for the OP's purpose. However, you still get formatting, so you might want to additionally strip leading blanks on each line.
Upvotes: 0