Reputation: 753
I am making a webscraper and i have never done it before so please point out if i am doing anything wrong
I am using golang to scrap
suppose i have been given a table
<table>
<tr>
<td>XYZ</td>
<td>XYZ</td>
<td>XYZ</td>
</tr>
<tr>
<td>XYZ</td>
<td>XYZ</td>
<td>XYZ</td>
</tr>
<tr>
<td>XYZ</td>
<td>XYZ</td>
<td>XYZ</td>
</tr>
<tr>
<td>XYZ</td>
<td>XYZ</td>
<td>XYZ</td>
</tr>
</table>
i want to extract data from each tr but only the second td
also can i return a new html string only having the content inside the table tag and remove everything elese in the html outside table tag?
Upvotes: 2
Views: 1821
Reputation: 892
Another way using golang.org/x/net/html.
(NB Speed may be gained by substituting t.DataAtom for t.Data as integer matching may be more efficient.)
package main
// https://stackoverflow.com/questions/39947716/how-to-access-inner-tags-token-in-golang
import (
"fmt"
"strings"
"golang.org/x/net/html"
)
func main() {
r := strings.NewReader(s)
z := html.NewTokenizer(r)
i := 0
for {
tt := z.Next()
switch tt {
case html.ErrorToken:
return
case html.StartTagToken:
t := z.Token()
switch t.Data {
case "tr":
i = 0
case "td":
if i == 1 {
z.Next()
t = z.Token()
fmt.Println(t.Data)
}
i++
}
}
}
}
var s string = `
<table>
<tr>
<td>XYZ</td>
<td>keep</td>
<td>XYZ</td>
</tr>
<tr>
<td>XYZ</td>
<td>it</td>
<td>XYZ</td>
</tr>
<tr>
<td>XYZ</td>
<td>simple</td>
<td>XYZ</td>
</tr>
<tr>
<td>XYZ</td>
<td>sister</td>
<td>XYZ</td>
</tr>
</table>`
https://play.golang.org/p/tAORrHy8eFJ
Upvotes: 2
Reputation: 5123
Well first of all your HTML example is wrong, you missed all the close tags </ tr > and </ td >
For this kind of job is always better use some sort of DOM selectors like jQuery. For Go I recommend goquery, it's little library and works pretty well. Your solution:
package main
import (
"log"
"github.com/PuerkitoBio/goquery"
)
func main() {
doc, err := goquery.NewDocument("http://your.url.com/foo.html")
if err != nil {
log.Fatal(err)
}
doc.Find("table tr").Each(func(_ int, tr *goquery.Selection) {
// for each <tr> found, find the <td>s inside
// ix is the index
tr.Find("td").Each(func(ix int, td *goquery.Selection) {
// print only the td number 2 (index == 1)
if ix == 1 {
log.Printf("index: %d content: '%s'", ix, td.Text())
}
})
})
}
As you may note td.Text() has the content of each td tag. I left you the full file that I used for testing https://play.golang.org/p/Rtb1Tqz1Wb
Upvotes: 1