dmzkrsk
dmzkrsk

Reputation: 2115

Go encoding transform issue

I have a following code in go:

import (
    "log"
    "net/http"
    "code.google.com/p/go.text/transform"
    "code.google.com/p/go.text/encoding/charmap"

)

...

res, err := http.Get(url)
if err != nil {
    log.Println("Cannot read", url);
    log.Println(err);
    continue
}
defer res.Body.Close()

The page I load contain non UTF-8 symbols. So I try to use transform

utfBody := transform.NewReader(res.Body, charmap.Windows1251.NewDecoder())

But the problem is, that it returns error even in this simple scenarion:

bytes, err := ioutil.ReadAll(utfBody)
log.Println(err)
if err == nil {
    log.Println(bytes)
}

transform: short destination buffer

It also actually sets bytes with some data, but in my real code I use goquery:

doc, err := goquery.NewDocumentFromReader(utfBody)

Which sees an error and fails with not data in return

I tried to pass "chunks" of res.Body to transform.NewReader and figuried out, that as long as res.Body contains no non-UTF8 data it works well. And when it contains non-UTF8 byte it fails with an error above.

I'm quite new to go and don't really understand what's going on and how to deal with this

Upvotes: 2

Views: 1283

Answers (1)

mat
mat

Reputation: 379

Without the whole code along with an example URL it's hard to tell what exactly is going wrong here.

That said, I can recommend the golang.org/x/net/html/charset package for this as it supports both char guessing and converting to UTF 8.

func fetchUtf8Bytes(url string) ([]byte, error) {
    res, err := http.Get(url)
    if err != nil {
        return nil, err
    }

    contentType := res.Header.Get("Content-Type") // Optional, better guessing
    utf8reader, err := charset.NewReader(res.Body, contentType)
    if err != nil {
        return nil, err
    }

    return ioutil.ReadAll(utf8reader)
}

Complete example: http://play.golang.org/p/olcBM9ughv

Upvotes: 7

Related Questions