user3173591
user3173591

Reputation: 35

Extract text content from HTML in Golang

What's the best way to extract inner substrings from strings in Golang?

input:

"Hello <p> this is paragraph </p> this is junk <p> this is paragraph 2 </p> this is junk 2"

output:

"this is paragraph \n
 this is paragraph 2"

Is there any string package/library for Go that already does something like this?

package main

import (
    "fmt"
    "strings"
)

func main() {
    longString := "Hello world <p> this is paragraph </p> this is junk <p> this is paragraph 2 </p> this is junk 2"

    newString := getInnerStrings("<p>", "</p>", longString)

    fmt.Println(newString)
   //output: this is paragraph \n
    //        this is paragraph 2

}
func getInnerStrings(start, end, str string) string {
    //Brain Freeze
        //Regex?
        //Bytes Loop?
}

thanks

Upvotes: 2

Views: 7117

Answers (3)

spicydog
spicydog

Reputation: 1714

Here is my function that I have been using it a lot.

func GetInnerSubstring(str string, prefix string, suffix string) string {
    var beginIndex, endIndex int
    beginIndex = strings.Index(str, prefix)
    if beginIndex == -1 {
        beginIndex = 0
        endIndex = 0
    } else if len(prefix) == 0 {
        beginIndex = 0
        endIndex = strings.Index(str, suffix)
        if endIndex == -1 || len(suffix) == 0 {
            endIndex = len(str)
        }
    } else {
        beginIndex += len(prefix)
        endIndex = strings.Index(str[beginIndex:], suffix)
        if endIndex == -1 {
            if strings.Index(str, suffix) < beginIndex {
                endIndex = beginIndex
            } else {
                endIndex = len(str)
            }
        } else {
            if len(suffix) == 0 {
                endIndex = len(str)
            } else {
                endIndex += beginIndex
            }
        }
    }

    return str[beginIndex:endIndex]
}

You can try it at the playground, https://play.golang.org/p/Xo0SJu0Vq4.

Upvotes: 1

Ali Altun
Ali Altun

Reputation: 407

StrExtract Retrieves a string between two delimiters.

StrExtract(sExper, cAdelim, cCdelim, nOccur)

sExper: Specifies the expression to search. sAdelim: Specifies the character that delimits the beginning of sExper.

sCdelim: Specifies the character that delimits the end of sExper.

nOccur: Specifies at which occurrence of cAdelim in sExper to start the extraction.

Go Play

package main

import (
    "fmt"
    "strings"
)

func main() {
    s := "a11ba22ba333ba4444ba55555ba666666b"
    fmt.Println("StrExtract1: ", StrExtract(s, "a", "b", 5))
}

func StrExtract(sExper, sAdelim, sCdelim string, nOccur int) string {

    aExper := strings.Split(sExper, sAdelim)

    if len(aExper) <= nOccur {
        return ""
    }

    sMember := aExper[nOccur]
    aExper = strings.Split(sMember, sCdelim)

    if len(aExper) == 1 {
        return ""
    }

    return aExper[0]
}

Upvotes: 0

thwd
thwd

Reputation: 24898

Don't use regular expressions to try and interpret HTML. Use a fully capable HTML tokenizer and parser.

I recommend you read this article on CodingHorror.

Upvotes: 6

Related Questions