Loint
Loint

Reputation: 3948

How to replace all html tag with empty string in golang

I'm trying to replace all html tag such as <div> </div> ... on empty string ( " " ) in golang with regex pattern ^[^.\/]*$/g to match all close tag. ex : </div>

My solution:

package main

import (
    "fmt"
    "regexp"
)

const Template = `^[^.\/]*$/g`

func main() {
    r := regexp.MustCompile(Template)
    s := "afsdf4534534!@@!!#<div>345345afsdf4534534!@@!!#</div>"

    res := r.ReplaceAllString(s, "")
    fmt.Println(res)
}

But output the same source string. What's wrong? Please help. Thank

Expect Result should: "afsdf4534534!@@!!#345345afsdf4534534!@@!!#"

Upvotes: 8

Views: 19879

Answers (6)

Daniel Morell
Daniel Morell

Reputation: 2586

The Problem with RegEx

This is a very simple RegEx replace method that removes HTML tags from well-formatted HTML in a string.

strip_html_regex.go

package main

import "regexp"

const regex = `<.*?>`

// This method uses a regular expresion to remove HTML tags.
func stripHtmlRegex(s string) string {
    r := regexp.MustCompile(regex)
    return r.ReplaceAllString(s, "")
}

Note: this does not work well with malformed HTML. Don't use this.

A better way

Since a string in Go can be treated as a slice of bytes it makes walking through the string and finding portions that are not in an HTML tag easy. When we Identify a valid portion of the string we can simply take a slice of that portion and append it using a strings.Builder.

strip_html.go

package main

import (
    "strings"
    "unicode/utf8"
)

const (
    htmlTagStart = 60 // Unicode `<`
    htmlTagEnd   = 62 // Unicode `>`
)

// Aggressively strips HTML tags from a string.
// It will only keep anything between `>` and `<`.
func stripHtmlTags(s string) string {
    // Setup a string builder and allocate enough memory for the new string.
    var builder strings.Builder
    builder.Grow(len(s) + utf8.UTFMax)

    in := false // True if we are inside an HTML tag.
    start := 0  // The index of the previous start tag character `<`
    end := 0    // The index of the previous end tag character `>`

    for i, c := range s {
        // If this is the last character and we are not in an HTML tag, save it.
        if (i+1) == len(s) && end >= start {
            builder.WriteString(s[end:])
        }

        // Keep going if the character is not `<` or `>`
        if c != htmlTagStart && c != htmlTagEnd {
            continue
        }

        if c == htmlTagStart {
            // Only update the start if we are not in a tag.
            // This make sure we strip out `<<br>` not just `<br>`
            if !in {
                start = i

                // Write the valid string between the close and start of the two tags.
                builder.WriteString(s[end:start])
            }
            in = true
            continue
        }
        // else c == htmlTagEnd
        in = false
        end = i + 1
    }
    s = builder.String()
    return s
}

If we run these two functions with the OP's text and some malformed HTML you will see that the result is not consistent.

main.go

package main

import "fmt"

func main() {
    s := "afsdf4534534!@@!!#<div>345345afsdf4534534!@@!!#</div>"

    res := stripHtmlTags(s)
    fmt.Println(res)

    // Malformed HTML examples
    fmt.Println("\n:: stripHTMLTags ::\n")

    fmt.Println(stripHtmlTags("Do something <strong>bold</strong>."))
    fmt.Println(stripHtmlTags("h1>I broke this</h1>"))
    fmt.Println(stripHtmlTags("This is <a href='#'>>broken link</a>."))
    fmt.Println(stripHtmlTags("I don't know ><where to <<em>start</em> this tag<."))
    
    // Regex Malformed HTML examples
    fmt.Println(":: stripHtmlRegex ::\n")

    fmt.Println(stripHtmlRegex("Do something <strong>bold</strong>."))
    fmt.Println(stripHtmlRegex("h1>I broke this</h1>"))
    fmt.Println(stripHtmlRegex("This is <a href='#'>>broken link</a>."))
    fmt.Println(stripHtmlRegex("I don't know ><where to <<em>start</em> this tag<."))
}

Output:

afsdf4534534!@@!!#345345afsdf4534534!@@!!#

:: stripHTMLTags ::

Do something bold.
I broke this
This is broken link.
start this tag

:: stripHtmlRegex ::

Do something bold.
h1>I broke this
This is >broken link.
I don't know >start this tag<.

Note: that the RegEx method does not remove all HTML tags consistently. To be honest, I am not good enough at RegEx to write a RegEx match string to properly handle stripping HTML.

Benchmarks

Aside from the advantage of being safer and more aggressive in the stripping of malformed HTML tags stripHtmlTags is about 4 times faster than stripHtmlRegex.

> go test -run=Calculate -bench=.
goos: windows
goarch: amd64
BenchmarkStripHtmlRegex-8          51516             22726 ns/op
BenchmarkStripHtmlTags-8          230678              5135 ns/op

Upvotes: 13

Adrien Parrochia
Adrien Parrochia

Reputation: 824

I found this code :

func RemoveHtmlTag(in string) string {
// regex to match html tag
   const pattern = `(<\/?[a-zA-A]+?[^>]*\/?>)*`
   r := regexp.MustCompile(pattern)
   groups := r.FindAllString(in, -1)
   // should replace long string first
   sort.Slice(groups, func(i, j int) bool {
      return len(groups[i]) > len(groups[j])
    })
    for _, group := range groups {
        if strings.TrimSpace(group) != "" {
            in = strings.ReplaceAll(in, group, "")
        }
     }

     in = strings.ReplaceAll(in, ">", "")
     return strings.ReplaceAll(in, "<", "")
}

Code source : https://gist.github.com/g10guang/04f11221dadf1ed019e0d3cf3e82caf3

fmt.Println(fmt.Sprintf("%s", RemoveHtmlTag("<b>>text with <where <>I don't know ><where to <<em>start</em> this tag<. tags<</b>"))

return

text with I don't know start this tag. tags

I did a bench comparing this method with a previous method on this page :


func stripHtmlRegex(input string) string {
    input = regexp.MustCompile(`<.*?>`).ReplaceAllString(input, "")
    input = strings.ReplaceAll(input, ">", "")
    return strings.ReplaceAll(input, "<", "")
    // return strings.NewReplacer(">", "", "<", "").Replace(r) // slow
}

and the lib bluemonday

func Sanitize(s string) string {
    p := bluemonday.StripTagsPolicy()
    return p.Sanitize(s)
}

result :

BenchmarkStripHtmlRegex-12        537960          2076 ns/op
BenchmarkRemoveHtmlTag-12         102960         10326 ns/op
BenchmarkSanitize-12              365942          3079 ns/op

Conclusion :

stripHtmlRegex then the blueMonday lib are faster

Test code: https://go.dev/play/p/7fkN9Vp-86U

Upvotes: 0

leogoesger
leogoesger

Reputation: 3840

We have tried this in production, but under certain corner cases, none of the proposed solutions really work. IF you need something is the robust, checkout Go internal library's unexported method (html-strip-tags-go pkg is basically an export of that with BSD-3 license). OR https://github.com/microcosm-cc/bluemonday is a pretty popular lib(BSD-3 as well) that we ended up using.

=================================================

Improvement on @Daniel Morell's answer. The only difference here is due to len of string evaluation on all utf-8 char. It will return between 1-4 for each char used. So len(è) would actually evaluate to 2. To fix that, we will convert string to rune.

https://go.dev/play/p/xo7Mrx5qw-_J

// Aggressively strips HTML tags from a string.
// It will only keep anything between `>` and `<`.
func stripHTMLTags(s string) string {
    // Supports utf-8, since some char could take more than 1 byte. ie: len("è") -> 2
    d := []rune(s)
    // Setup a string builder and allocate enough memory for the new string.
    var builder strings.Builder
    builder.Grow(len(d) + utf8.UTFMax)

    in := false // True if we are inside an HTML tag.
    start := 0  // The index of the previous start tag character `<`
    end := 0    // The index of the previous end tag character `>`

    for i, c := range d {
        // If this is the last character and we are not in an HTML tag, save it.
        if (i+1) == len(d) && end >= start {
            builder.WriteString(s[end:])
        }

        // Keep going if the character is not `<` or `>`
        if c != htmlTagStart && c != htmlTagEnd {
            continue
        }

        if c == htmlTagStart {
            // Only update the start if we are not in a tag.
            // This make sure we strip out `<<br>` not just `<br>`
            if !in {
                start = i
            }
            in = true

            // Write the valid string between the close and start of the two tags.
            builder.WriteString(s[end:start])
            continue
        }
        // else c == htmlTagEnd
        in = false
        end = i + 1
    }
    s = builder.String()
    return s
}

Upvotes: -1

Alfonso Moscato
Alfonso Moscato

Reputation: 61

Starting from @Daniel Morelli function, I have created another function with some more possibilities. I am sharing it here if it can be useful for someone:

//CreateCleanWords takes a string and returns a string array with all words in string
// rules:
// words of lenght >= of minAcceptedLenght
// everything between < and > is discarded
// admitted characters: numbers, letters, and all characters in validRunes map
// words not present in wordBlackList map
// word separators are space or single quote (could be improved with a map of separators)
func CreateCleanWords(s string) []string {
    // Setup a string builder and allocate enough memory for the new string.
    var builder strings.Builder
    builder.Grow(len(s) + utf8.UTFMax)

    insideTag := false // True if we are inside an HTML tag.
    var c rune
    var managed bool = false
    var valid bool = false
    var finalWords []string
    var singleQuote rune = '\''
    var minAcceptedLenght = 4

    var wordBlackList map[string]bool = map[string]bool{
        "sull":  false,
        "sullo": false,
        "sulla": false,
        "sugli": false,
        "sulle": false,
        "alla":  false,
        "all":   false,
        "allo":  false,
        "agli":  false,
        "alle":  false,
        "dell":  false,
        "della": false,
        "dello": false,
        "degli": false,
        "delle": false,
        "dall":  false,
        "dalla": false,
        "dallo": false,
        "dalle": false,
        "dagli": false,
    }

    var validRunes map[rune]bool = map[rune]bool{
        'à': true,
        'è': true,
        'é': true,
        'ì': true,
        'ò': true,
        'ù': true,
        '€': true,
        '$': true,
        '£': true,
        '-': true,
    }

    for _, c = range s {

        managed = false
        valid = false

        //show := string(c)
        //fmt.Println(show)

        // found < from here on ignore characters
        if !managed && c == htmlTagStart {
            insideTag = true
            managed = true
            valid = false
        }
        // found > characters are valid now
        if !managed && c == htmlTagEnd {
            insideTag = false
            managed = true
            valid = false
        }

        // if we are inside an HTML tag, we don't check anything because we won't take anything
        // until we reach the tag end
        if !insideTag {
            if !managed && unicode.IsSpace(c) || c == singleQuote {
                // found space if I have a valid word let's add it to  word array
                // only bigger than 3 letters
                if builder.Len() >= minAcceptedLenght {
                    word := strings.ToLower((builder).String())
                    //first check if the word is not in a black list
                    if _, ok := wordBlackList[word]; !ok {
                        // the word is not in blacklist let's add to finalWords
                        finalWords = append(finalWords, word)
                    }

                }
                // make builder ready for next token
                builder.Reset()

                valid = false
                managed = true
            }

            // letters and digits are welvome
            if !managed {
                valid = unicode.IsLetter(c) || unicode.IsDigit(c)
                managed = valid
            }

            // other italian runes accepted
            if !managed {
                _, valid = validRunes[c]
            }

            if valid {
                builder.WriteRune(c)
            }
        }
    }
    // remember to check the last word after exiting from for!
    if builder.Len() > minAcceptedLenght {
        //first check if the word is not in a black list
        word := builder.String()
        if _, ok := wordBlackList[word]; !ok {
            // the word is not in blacklist let's add to finalWords
            finalWords = append(finalWords, word)
        }
        builder.Reset()

    }

    return finalWords
}


Upvotes: 0

Bill Zelenko
Bill Zelenko

Reputation: 2858

For those who came here looking for a quick solution, there is a library that does this: bluemonday.

Package bluemonday provides a way of describing a whitelist of HTML elements and attributes as a policy, and for that policy to be applied to untrusted strings from users that may contain markup. All elements and attributes not on the whitelist will be stripped.

package main

import (
    "fmt"

    "github.com/microcosm-cc/bluemonday"
)

func main() {
    // Do this once for each unique policy, and use the policy for the life of the program
    // Policy creation/editing is not safe to use in multiple goroutines
    p := bluemonday.StripTagsPolicy()

    // The policy can then be used to sanitize lots of input and it is safe to use the policy in multiple goroutines
    html := p.Sanitize(
        `<a onblur="alert(secret)" href="http://www.google.com">Google</a>`,
    )

    // Output:
    // Google
    fmt.Println(html)
}

https://play.golang.org/p/jYARzNwPToZ

Upvotes: 15

sh.seo
sh.seo

Reputation: 1610

if you want replace all HTML TAG, using strip of html tag.

regex to match HTML tags is not good idea.

package main

import (
    "fmt"
    "github.com/grokify/html-strip-tags-go"
)

func main() {
    text := "afsdf4534534!@@!!#<div>345345afsdf4534534!@@!!#</div>"

    stripped := strip.StripTags(text)

    fmt.Println(text)
    fmt.Println(stripped)
}

Upvotes: 9

Related Questions