Ezio
Ezio

Reputation: 753

How to access inner tags token in golang?

I am making a webscraper and i have never done it before so please point out if i am doing anything wrong

I am using golang to scrap

suppose i have been given a table

<table>
   <tr>
         <td>XYZ</td>
         <td>XYZ</td>
         <td>XYZ</td> 
   </tr>
   <tr>
         <td>XYZ</td>
         <td>XYZ</td>
         <td>XYZ</td> 
   </tr>
   <tr>
         <td>XYZ</td>
         <td>XYZ</td>
         <td>XYZ</td> 
   </tr>
   <tr>
         <td>XYZ</td>
         <td>XYZ</td>
         <td>XYZ</td> 
   </tr>
</table>

i want to extract data from each tr but only the second td

also can i return a new html string only having the content inside the table tag and remove everything elese in the html outside table tag?

Upvotes: 2

Views: 1821

Answers (2)

David Smith
David Smith

Reputation: 892

Another way using golang.org/x/net/html.

(NB Speed may be gained by substituting t.DataAtom for t.Data as integer matching may be more efficient.)

package main

// https://stackoverflow.com/questions/39947716/how-to-access-inner-tags-token-in-golang

import (
    "fmt"
    "strings"

    "golang.org/x/net/html"
)

func main() {

    r := strings.NewReader(s)

    z := html.NewTokenizer(r)

    i := 0

    for {
        tt := z.Next()
        switch tt {

        case html.ErrorToken:
            return

        case html.StartTagToken:
            t := z.Token()

            switch t.Data {

            case "tr":
                i = 0

            case "td":
                if i == 1 {
                    z.Next()
                    t = z.Token()
                    fmt.Println(t.Data)
                }
                i++
            }
        }
    }
}

var s string = `
<table>
   <tr>
         <td>XYZ</td>
         <td>keep</td>
         <td>XYZ</td> 
   </tr>
   <tr>
         <td>XYZ</td>
         <td>it</td>
         <td>XYZ</td> 
   </tr>
   <tr>
         <td>XYZ</td>
         <td>simple</td>
         <td>XYZ</td> 
   </tr>
   <tr>
         <td>XYZ</td>
         <td>sister</td>
         <td>XYZ</td> 
   </tr>
</table>`

https://play.golang.org/p/tAORrHy8eFJ

Upvotes: 2

Yandry Pozo
Yandry Pozo

Reputation: 5123

Well first of all your HTML example is wrong, you missed all the close tags </ tr > and </ td >

For this kind of job is always better use some sort of DOM selectors like jQuery. For Go I recommend goquery, it's little library and works pretty well. Your solution:

package main

import (
    "log"

    "github.com/PuerkitoBio/goquery"
)

func main() {
    doc, err := goquery.NewDocument("http://your.url.com/foo.html")
    if err != nil {
        log.Fatal(err)
    }

    doc.Find("table tr").Each(func(_ int, tr *goquery.Selection) {

        // for each <tr> found, find the <td>s inside
        // ix is the index
        tr.Find("td").Each(func(ix int, td *goquery.Selection) {

            // print only the td number 2 (index == 1)
            if ix == 1 {
                log.Printf("index: %d content: '%s'", ix, td.Text())
            }
        })
    })
}

As you may note td.Text() has the content of each td tag. I left you the full file that I used for testing https://play.golang.org/p/Rtb1Tqz1Wb

Upvotes: 1

Related Questions