user3443581
user3443581

Reputation: 63

Go - Getting the text of a single particular HTML element from a document with a known structure

In a little script I'm writing, I make a POST to a web service and receive an HTML document in response. This document is largely irrelevant to my needs, with the exception of the contents of a single textarea. This textarea is the only textarea in the page and it has a particular name that I know ahead of time. I want to grab that text without worrying about anything else in the document. Currently I'm using regex to get the correct line and then to delete the tags, but I feel like there's probably a better way.

Here's what the document looks like:

<html><body>
<form name="query" action="http://www.example.net/action.php" method="post">
    <textarea type="text" name="nameiknow"/>The text I want</textarea>
    <div id="button">
        <input type="submit" value="Submit" />
    </div>
</form>
</body></html>

And here's how I'm currently getting the text:

s := string(body)

// Gets the line I want
r, _ := regexp.Compile("<textarea.*name=(\"|')nameiknow(\"|').*textarea>")
s = r.FindString(s)

// Deletes the tags
r, _ = regexp.Compile("<[^>]*>")
s = r.ReplaceAllString(s, "")

I think using a full HTML parser might be a bit too much in this case, which is why I went in this direction, though for all I know there's something much better out there.

I appreciate any advice you may have.

Upvotes: 1

Views: 2367

Answers (2)

Kluyg
Kluyg

Reputation: 5347

Take a look at this package: https://github.com/PuerkitoBio/goquery. It's like jQuery but for Go. It allows you to do things like

text := doc.Find("strong").Text()

Full working example:

package main

import (
    "bytes"
    "fmt"

    "github.com/PuerkitoBio/goquery"
)

var s = `<html><body>
<form name="query" action="http://www.example.net/action.php" method="post">
    <textarea type="text" name="nameiknow">The text I want</textarea>
    <div id="button">
        <input type="submit" value="Submit" />
    </div>
</form>
</body></html>`

func main() {
    r := bytes.NewReader([]byte(s))
    doc, _ := goquery.NewDocumentFromReader(r)
    text := doc.Find("textarea").Text()
    fmt.Println(text)
}

Prints: "The text I want".

Upvotes: 4

Sabuj Hassan
Sabuj Hassan

Reputation: 39375

Though this is not the best practice to parse HTML using regex. But as you wished, here it is:

(<textarea\b[^>]*\bname\s*=\s*(?:\"|')\s*nameiknow\s*(?:\"|')[^<]*<\/textarea>)

Upvotes: 2

Related Questions