JakubKubera
JakubKubera

Reputation: 436

Convert any encoding to UTF 8 in Go

I'm downloading messages via IMAP. Next I'm adding parsed message into MongoDB. And I've a problem, because MongoDB support only UTF 8. And I wanna convert any encoding to UTF 8. Codes are various. How can I convert each string to UTF 8?

I know, that I can convert to binary, but I have to have normal text, because I have to searching phrases in database. Unless, can I searching normal text in binary?

Upvotes: 11

Views: 26962

Answers (4)

a-h
a-h

Reputation: 4274

In 2020 I found that https://pkg.go.dev/mod/golang.org/x/text worked well for me.

package main

import (
    "bytes"
    "fmt"
    "io/ioutil"
    "log"

    "golang.org/x/text/encoding/ianaindex"
    "golang.org/x/text/transform"
)

func main() {
    text := "\xa35 for Pepp\xe9"
    charset := "latin1"
    e, err := ianaindex.MIME.Encoding(charset)
    if err != nil {
        log.Fatal(err)
    }
    r := transform.NewReader(bytes.NewBufferString(text), e.NewDecoder())
    result, err := ioutil.ReadAll(r)
    if err != nil {
        log.Fatal(err)
    }
    fmt.Printf("%s\n", result) //outputs £5 for Peppé
}

https://play.golang.org/p/Hl7r146UwhT

Upvotes: 7

Not_a_Golfer
Not_a_Golfer

Reputation: 49157

I'm using the go-charset project to do this: https://code.google.com/p/go-charset/

It's pretty straightforward, you create a reader from a charset and it translates to utf-8 automatically. example from the library:

r, err := charset.NewReader(strings.NewReader("\xa35 for Pepp\xe9"), "latin1")
if err != nil {
    log.Fatal(err)
}
result, err := ioutil.ReadAll(r)
if err != nil {
    log.Fatal(err)
}
fmt.Printf("%s\n", result)  //outputs £5 for Peppé

Now, in my case I know the charset because it comes from web pages and I read the headers/meta tags. If you need to detect the charset automatically by heuristics, you'll need another library for that, such as this one: https://github.com/saintfish/chardet

I haven't used it but it also looks pretty simple to use:

detector := chardet.NewTextDetector()
result, err := detector.DetectBest(some_text)
if err == nil {
    fmt.Printf(
        "Detected charset is %s, language is %s",
        result.Charset,
        result.Language)
}

Upvotes: 12

Liu Zhihui
Liu Zhihui

Reputation: 251

charset.NewReader in package golang.org/x/net/html/charset can't deal with encoding gb2312. charset.NewReaderLabel can deal with it.

import  (
    "io/ioutil"
    "golang.org/x/net/html/charset"
)

func convrtToUTF8(str string, origEncoding string) string {
    strBytes := []byte(str)
    byteReader := bytes.NewReader(strBytes)
    reader, _ := charset.NewReaderLabel(origEncoding, byteReader)
    strBytes, _ = ioutil.ReadAll(reader)
    return string(strBytes)
}

Upvotes: 5

JakubKubera
JakubKubera

Reputation: 436

I've found a better package, which uses iconv. Usage is trivial, it is described in the documentation. For example:

output,_ := iconv.ConvertString("Hello World!", "windows-1252", "utf-8")

Link to the package: https://github.com/djimenez/iconv-go

Upvotes: 1

Related Questions