Reputation: 3948
I'm trying to replace all html tag such as <div> </div>
... on empty string ( " " ) in golang with regex pattern ^[^.\/]*$/g
to match all close tag. ex : </div>
My solution:
package main
import (
"fmt"
"regexp"
)
const Template = `^[^.\/]*$/g`
func main() {
r := regexp.MustCompile(Template)
s := "afsdf4534534!@@!!#<div>345345afsdf4534534!@@!!#</div>"
res := r.ReplaceAllString(s, "")
fmt.Println(res)
}
But output the same source string. What's wrong? Please help. Thank
Expect Result should: "afsdf4534534!@@!!#345345afsdf4534534!@@!!#"
Upvotes: 8
Views: 19879
Reputation: 2586
This is a very simple RegEx replace method that removes HTML tags from well-formatted HTML in a string.
strip_html_regex.go
package main
import "regexp"
const regex = `<.*?>`
// This method uses a regular expresion to remove HTML tags.
func stripHtmlRegex(s string) string {
r := regexp.MustCompile(regex)
return r.ReplaceAllString(s, "")
}
Note: this does not work well with malformed HTML. Don't use this.
Since a string in Go can be treated as a slice of bytes it makes walking through the string and finding portions that are not in an HTML tag easy. When we Identify a valid portion of the string we can simply take a slice of that portion and append it using a strings.Builder
.
strip_html.go
package main
import (
"strings"
"unicode/utf8"
)
const (
htmlTagStart = 60 // Unicode `<`
htmlTagEnd = 62 // Unicode `>`
)
// Aggressively strips HTML tags from a string.
// It will only keep anything between `>` and `<`.
func stripHtmlTags(s string) string {
// Setup a string builder and allocate enough memory for the new string.
var builder strings.Builder
builder.Grow(len(s) + utf8.UTFMax)
in := false // True if we are inside an HTML tag.
start := 0 // The index of the previous start tag character `<`
end := 0 // The index of the previous end tag character `>`
for i, c := range s {
// If this is the last character and we are not in an HTML tag, save it.
if (i+1) == len(s) && end >= start {
builder.WriteString(s[end:])
}
// Keep going if the character is not `<` or `>`
if c != htmlTagStart && c != htmlTagEnd {
continue
}
if c == htmlTagStart {
// Only update the start if we are not in a tag.
// This make sure we strip out `<<br>` not just `<br>`
if !in {
start = i
// Write the valid string between the close and start of the two tags.
builder.WriteString(s[end:start])
}
in = true
continue
}
// else c == htmlTagEnd
in = false
end = i + 1
}
s = builder.String()
return s
}
If we run these two functions with the OP's text and some malformed HTML you will see that the result is not consistent.
main.go
package main
import "fmt"
func main() {
s := "afsdf4534534!@@!!#<div>345345afsdf4534534!@@!!#</div>"
res := stripHtmlTags(s)
fmt.Println(res)
// Malformed HTML examples
fmt.Println("\n:: stripHTMLTags ::\n")
fmt.Println(stripHtmlTags("Do something <strong>bold</strong>."))
fmt.Println(stripHtmlTags("h1>I broke this</h1>"))
fmt.Println(stripHtmlTags("This is <a href='#'>>broken link</a>."))
fmt.Println(stripHtmlTags("I don't know ><where to <<em>start</em> this tag<."))
// Regex Malformed HTML examples
fmt.Println(":: stripHtmlRegex ::\n")
fmt.Println(stripHtmlRegex("Do something <strong>bold</strong>."))
fmt.Println(stripHtmlRegex("h1>I broke this</h1>"))
fmt.Println(stripHtmlRegex("This is <a href='#'>>broken link</a>."))
fmt.Println(stripHtmlRegex("I don't know ><where to <<em>start</em> this tag<."))
}
Output:
afsdf4534534!@@!!#345345afsdf4534534!@@!!#
:: stripHTMLTags ::
Do something bold.
I broke this
This is broken link.
start this tag
:: stripHtmlRegex ::
Do something bold.
h1>I broke this
This is >broken link.
I don't know >start this tag<.
Note: that the RegEx method does not remove all HTML tags consistently. To be honest, I am not good enough at RegEx to write a RegEx match string to properly handle stripping HTML.
Aside from the advantage of being safer and more aggressive in the stripping of malformed HTML tags stripHtmlTags
is about 4 times faster than stripHtmlRegex
.
> go test -run=Calculate -bench=.
goos: windows
goarch: amd64
BenchmarkStripHtmlRegex-8 51516 22726 ns/op
BenchmarkStripHtmlTags-8 230678 5135 ns/op
Upvotes: 13
Reputation: 824
I found this code :
func RemoveHtmlTag(in string) string {
// regex to match html tag
const pattern = `(<\/?[a-zA-A]+?[^>]*\/?>)*`
r := regexp.MustCompile(pattern)
groups := r.FindAllString(in, -1)
// should replace long string first
sort.Slice(groups, func(i, j int) bool {
return len(groups[i]) > len(groups[j])
})
for _, group := range groups {
if strings.TrimSpace(group) != "" {
in = strings.ReplaceAll(in, group, "")
}
}
in = strings.ReplaceAll(in, ">", "")
return strings.ReplaceAll(in, "<", "")
}
Code source : https://gist.github.com/g10guang/04f11221dadf1ed019e0d3cf3e82caf3
fmt.Println(fmt.Sprintf("%s", RemoveHtmlTag("<b>>text with <where <>I don't know ><where to <<em>start</em> this tag<. tags<</b>"))
return
text with I don't know start this tag. tags
I did a bench comparing this method with a previous method on this page :
func stripHtmlRegex(input string) string {
input = regexp.MustCompile(`<.*?>`).ReplaceAllString(input, "")
input = strings.ReplaceAll(input, ">", "")
return strings.ReplaceAll(input, "<", "")
// return strings.NewReplacer(">", "", "<", "").Replace(r) // slow
}
and the lib bluemonday
func Sanitize(s string) string {
p := bluemonday.StripTagsPolicy()
return p.Sanitize(s)
}
result :
BenchmarkStripHtmlRegex-12 537960 2076 ns/op
BenchmarkRemoveHtmlTag-12 102960 10326 ns/op
BenchmarkSanitize-12 365942 3079 ns/op
Conclusion :
stripHtmlRegex then the blueMonday lib are faster
Test code: https://go.dev/play/p/7fkN9Vp-86U
Upvotes: 0
Reputation: 3840
We have tried this in production, but under certain corner cases, none of the proposed solutions really work. IF you need something is the robust, checkout Go internal library's unexported method (html-strip-tags-go pkg is basically an export of that with BSD-3 license). OR https://github.com/microcosm-cc/bluemonday is a pretty popular lib(BSD-3 as well) that we ended up using.
=================================================
Improvement on @Daniel Morell's answer. The only difference here is due to len
of string evaluation on all utf-8 char. It will return between 1-4 for each char used. So len(è)
would actually evaluate to 2
. To fix that, we will convert string to rune
.
https://go.dev/play/p/xo7Mrx5qw-_J
// Aggressively strips HTML tags from a string.
// It will only keep anything between `>` and `<`.
func stripHTMLTags(s string) string {
// Supports utf-8, since some char could take more than 1 byte. ie: len("è") -> 2
d := []rune(s)
// Setup a string builder and allocate enough memory for the new string.
var builder strings.Builder
builder.Grow(len(d) + utf8.UTFMax)
in := false // True if we are inside an HTML tag.
start := 0 // The index of the previous start tag character `<`
end := 0 // The index of the previous end tag character `>`
for i, c := range d {
// If this is the last character and we are not in an HTML tag, save it.
if (i+1) == len(d) && end >= start {
builder.WriteString(s[end:])
}
// Keep going if the character is not `<` or `>`
if c != htmlTagStart && c != htmlTagEnd {
continue
}
if c == htmlTagStart {
// Only update the start if we are not in a tag.
// This make sure we strip out `<<br>` not just `<br>`
if !in {
start = i
}
in = true
// Write the valid string between the close and start of the two tags.
builder.WriteString(s[end:start])
continue
}
// else c == htmlTagEnd
in = false
end = i + 1
}
s = builder.String()
return s
}
Upvotes: -1
Reputation: 61
Starting from @Daniel Morelli function, I have created another function with some more possibilities. I am sharing it here if it can be useful for someone:
//CreateCleanWords takes a string and returns a string array with all words in string
// rules:
// words of lenght >= of minAcceptedLenght
// everything between < and > is discarded
// admitted characters: numbers, letters, and all characters in validRunes map
// words not present in wordBlackList map
// word separators are space or single quote (could be improved with a map of separators)
func CreateCleanWords(s string) []string {
// Setup a string builder and allocate enough memory for the new string.
var builder strings.Builder
builder.Grow(len(s) + utf8.UTFMax)
insideTag := false // True if we are inside an HTML tag.
var c rune
var managed bool = false
var valid bool = false
var finalWords []string
var singleQuote rune = '\''
var minAcceptedLenght = 4
var wordBlackList map[string]bool = map[string]bool{
"sull": false,
"sullo": false,
"sulla": false,
"sugli": false,
"sulle": false,
"alla": false,
"all": false,
"allo": false,
"agli": false,
"alle": false,
"dell": false,
"della": false,
"dello": false,
"degli": false,
"delle": false,
"dall": false,
"dalla": false,
"dallo": false,
"dalle": false,
"dagli": false,
}
var validRunes map[rune]bool = map[rune]bool{
'à': true,
'è': true,
'é': true,
'ì': true,
'ò': true,
'ù': true,
'€': true,
'$': true,
'£': true,
'-': true,
}
for _, c = range s {
managed = false
valid = false
//show := string(c)
//fmt.Println(show)
// found < from here on ignore characters
if !managed && c == htmlTagStart {
insideTag = true
managed = true
valid = false
}
// found > characters are valid now
if !managed && c == htmlTagEnd {
insideTag = false
managed = true
valid = false
}
// if we are inside an HTML tag, we don't check anything because we won't take anything
// until we reach the tag end
if !insideTag {
if !managed && unicode.IsSpace(c) || c == singleQuote {
// found space if I have a valid word let's add it to word array
// only bigger than 3 letters
if builder.Len() >= minAcceptedLenght {
word := strings.ToLower((builder).String())
//first check if the word is not in a black list
if _, ok := wordBlackList[word]; !ok {
// the word is not in blacklist let's add to finalWords
finalWords = append(finalWords, word)
}
}
// make builder ready for next token
builder.Reset()
valid = false
managed = true
}
// letters and digits are welvome
if !managed {
valid = unicode.IsLetter(c) || unicode.IsDigit(c)
managed = valid
}
// other italian runes accepted
if !managed {
_, valid = validRunes[c]
}
if valid {
builder.WriteRune(c)
}
}
}
// remember to check the last word after exiting from for!
if builder.Len() > minAcceptedLenght {
//first check if the word is not in a black list
word := builder.String()
if _, ok := wordBlackList[word]; !ok {
// the word is not in blacklist let's add to finalWords
finalWords = append(finalWords, word)
}
builder.Reset()
}
return finalWords
}
Upvotes: 0
Reputation: 2858
For those who came here looking for a quick solution, there is a library that does this: bluemonday.
Package bluemonday provides a way of describing a whitelist of HTML elements and attributes as a policy, and for that policy to be applied to untrusted strings from users that may contain markup. All elements and attributes not on the whitelist will be stripped.
package main
import (
"fmt"
"github.com/microcosm-cc/bluemonday"
)
func main() {
// Do this once for each unique policy, and use the policy for the life of the program
// Policy creation/editing is not safe to use in multiple goroutines
p := bluemonday.StripTagsPolicy()
// The policy can then be used to sanitize lots of input and it is safe to use the policy in multiple goroutines
html := p.Sanitize(
`<a onblur="alert(secret)" href="http://www.google.com">Google</a>`,
)
// Output:
// Google
fmt.Println(html)
}
https://play.golang.org/p/jYARzNwPToZ
Upvotes: 15
Reputation: 1610
if you want replace all HTML TAG, using strip of html tag.
regex to match HTML tags is not good idea.
package main
import (
"fmt"
"github.com/grokify/html-strip-tags-go"
)
func main() {
text := "afsdf4534534!@@!!#<div>345345afsdf4534534!@@!!#</div>"
stripped := strip.StripTags(text)
fmt.Println(text)
fmt.Println(stripped)
}
Upvotes: 9