How to extract only text from HTML in Golang?

Question

To extract text from HTML, I use a fully HTML5-compliant tokenizer and parser, like this

    s := `
Links:
Foo

BarBaz
TEXT I WANT
`

    domDocTest := html.NewTokenizer(strings.NewReader(s))
    for tokenType := domDocTest.Next(); tokenType != html.ErrorToken; {
        if tokenType != html.TextToken {
            tokenType = domDocTest.Next()
            continue
        }
        TxtContent := strings.TrimSpace(html.UnescapeString(string(domDocTest.Text())))
        if len(TxtContent) > 0 {
            fmt.Printf("%s
", TxtContent)
        }
        tokenType = domDocTest.Next()
    }

but I got this result

Links:
Foo
BarBaz
TEXT
I
WANT
/*  */

I don't want CDATA content. Some idea, how to get only the text content?

LeMoussel · Accepted Answer

As indicated by @Eric Pauley, I look at TextTokens & StartTagTokens. Here is my solution

    s := `
Links:
Foo

BarBaz
TEXT I WANT
`

    domDocTest := html.NewTokenizer(strings.NewReader(s))
    previousStartTokenTest := domDocTest.Token()
loopDomTest:
    for {
        tt := domDocTest.Next()
        switch {
        case tt == html.ErrorToken:
            break loopDomTest // End of the document,  done
        case tt == html.StartTagToken:
            previousStartTokenTest = domDocTest.Token()
        case tt == html.TextToken:
            if previousStartTokenTest.Data == "script" {
                continue
            }
            TxtContent := strings.TrimSpace(html.UnescapeString(string(domDocTest.Text())))
            if len(TxtContent) > 0 {
                fmt.Printf("%s
", TxtContent)
            }
        }
    }

How to extract only text from HTML in Golang?

Answers (2)

Related Questions