Reputation: 2115
I have a following code in go:
import (
"log"
"net/http"
"code.google.com/p/go.text/transform"
"code.google.com/p/go.text/encoding/charmap"
)
...
res, err := http.Get(url)
if err != nil {
log.Println("Cannot read", url);
log.Println(err);
continue
}
defer res.Body.Close()
The page I load contain non UTF-8 symbols. So I try to use transform
utfBody := transform.NewReader(res.Body, charmap.Windows1251.NewDecoder())
But the problem is, that it returns error even in this simple scenarion:
bytes, err := ioutil.ReadAll(utfBody)
log.Println(err)
if err == nil {
log.Println(bytes)
}
transform: short destination buffer
It also actually sets bytes
with some data, but in my real code I use goquery
:
doc, err := goquery.NewDocumentFromReader(utfBody)
Which sees an error and fails with not data in return
I tried to pass "chunks" of res.Body
to transform.NewReader
and figuried out, that as long as res.Body contains no non-UTF8 data it works well. And when it contains non-UTF8 byte it fails with an error above.
I'm quite new to go and don't really understand what's going on and how to deal with this
Upvotes: 2
Views: 1283
Reputation: 379
Without the whole code along with an example URL it's hard to tell what exactly is going wrong here.
That said, I can recommend the golang.org/x/net/html/charset
package for this as it supports both char guessing and converting to UTF 8.
func fetchUtf8Bytes(url string) ([]byte, error) {
res, err := http.Get(url)
if err != nil {
return nil, err
}
contentType := res.Header.Get("Content-Type") // Optional, better guessing
utf8reader, err := charset.NewReader(res.Body, contentType)
if err != nil {
return nil, err
}
return ioutil.ReadAll(utf8reader)
}
Complete example: http://play.golang.org/p/olcBM9ughv
Upvotes: 7