How to access inner tags token in golang?

Question

I am making a webscraper and i have never done it before so please point out if i am doing anything wrong

I am using golang to scrap

suppose i have been given a table

i want to extract data from each tr but only the second td

also can i return a new html string only having the content inside the table tag and remove everything elese in the html outside table tag?

Yandry Pozo · Accepted Answer

Well first of all your HTML example is wrong, you missed all the close tags </ tr > and </ td >

For this kind of job is always better use some sort of DOM selectors like jQuery. For Go I recommend goquery, it's little library and works pretty well. Your solution:

package main

import (
    "log"

    "github.com/PuerkitoBio/goquery"
)

func main() {
    doc, err := goquery.NewDocument("http://your.url.com/foo.html")
    if err != nil {
        log.Fatal(err)
    }

    doc.Find("table tr").Each(func(_ int, tr *goquery.Selection) {

        // for each  found, find the s inside
        // ix is the index
        tr.Find("td").Each(func(ix int, td *goquery.Selection) {

            // print only the td number 2 (index == 1)
            if ix == 1 {
                log.Printf("index: %d content: '%s'", ix, td.Text())
            }
        })
    })
}

As you may note td.Text() has the content of each td tag. I left you the full file that I used for testing https://play.golang.org/p/Rtb1Tqz1Wb

How to access inner tags token in golang?

Answers (2)

Related Questions