Reputation: 436
I'm downloading messages via IMAP. Next I'm adding parsed message into MongoDB. And I've a problem, because MongoDB support only UTF 8. And I wanna convert any encoding to UTF 8. Codes are various. How can I convert each string to UTF 8?
I know, that I can convert to binary, but I have to have normal text, because I have to searching phrases in database. Unless, can I searching normal text in binary?
Upvotes: 11
Views: 26962
Reputation: 4274
In 2020 I found that https://pkg.go.dev/mod/golang.org/x/text worked well for me.
package main
import (
"bytes"
"fmt"
"io/ioutil"
"log"
"golang.org/x/text/encoding/ianaindex"
"golang.org/x/text/transform"
)
func main() {
text := "\xa35 for Pepp\xe9"
charset := "latin1"
e, err := ianaindex.MIME.Encoding(charset)
if err != nil {
log.Fatal(err)
}
r := transform.NewReader(bytes.NewBufferString(text), e.NewDecoder())
result, err := ioutil.ReadAll(r)
if err != nil {
log.Fatal(err)
}
fmt.Printf("%s\n", result) //outputs £5 for Peppé
}
https://play.golang.org/p/Hl7r146UwhT
Upvotes: 7
Reputation: 49157
I'm using the go-charset
project to do this: https://code.google.com/p/go-charset/
It's pretty straightforward, you create a reader from a charset and it translates to utf-8 automatically. example from the library:
r, err := charset.NewReader(strings.NewReader("\xa35 for Pepp\xe9"), "latin1")
if err != nil {
log.Fatal(err)
}
result, err := ioutil.ReadAll(r)
if err != nil {
log.Fatal(err)
}
fmt.Printf("%s\n", result) //outputs £5 for Peppé
Now, in my case I know the charset because it comes from web pages and I read the headers/meta tags. If you need to detect the charset automatically by heuristics, you'll need another library for that, such as this one: https://github.com/saintfish/chardet
I haven't used it but it also looks pretty simple to use:
detector := chardet.NewTextDetector()
result, err := detector.DetectBest(some_text)
if err == nil {
fmt.Printf(
"Detected charset is %s, language is %s",
result.Charset,
result.Language)
}
Upvotes: 12
Reputation: 251
charset.NewReader
in package golang.org/x/net/html/charset
can't deal with encoding gb2312
. charset.NewReaderLabel
can deal with it.
import (
"io/ioutil"
"golang.org/x/net/html/charset"
)
func convrtToUTF8(str string, origEncoding string) string {
strBytes := []byte(str)
byteReader := bytes.NewReader(strBytes)
reader, _ := charset.NewReaderLabel(origEncoding, byteReader)
strBytes, _ = ioutil.ReadAll(reader)
return string(strBytes)
}
Upvotes: 5
Reputation: 436
I've found a better package, which uses iconv. Usage is trivial, it is described in the documentation. For example:
output,_ := iconv.ConvertString("Hello World!", "windows-1252", "utf-8")
Link to the package: https://github.com/djimenez/iconv-go
Upvotes: 1