Devin Young
Devin Young

Reputation: 841

golang XML ending parsing with 'invalid UTF-8' error

I am having an issue unmarshaling XML with unicode characters.

When attempting to parse XML with standard English characters, it parses the entire file and unmarshals correctly without any issues. However, if the the XML file contains a character such as ñ, á, or – (em-dash), it stops parsing the XML and only returns the items in the array that are before the item with that character.

For example, here is XML:

<items>
  <item>
    <ID value="1" name="Item 1" GCName="Item 1" />
  </item>
  <item>
    <ID value="2" name="Item 2" GCName="Item 2" />
  </item>
  <item>
    <ID value="3" name="Item 3" GCName="Item 3 With ñ" />
  </item>
  <item>
    <ID value="4" name="Item 4" GCName="Item 4" />
  </item>
</items>

This is my Go code (rough without any imports):

# main.go

type Response struct {
    Items []Items `xml:"items"`
}

type Items struct {
    Item []Item `xml:"item"`
}

type Item struct {
    ID    ItemID `xml:"ID"`
}

type ItemID struct {
    Value  string `xml:"value,attr"`
    Name   string `xml:"name,attr"`
    GCName string `xml:"GCName,attr"`
}

func main() {
    xmlFile, err := os.Open("C:\path\to\xml\file.xml")
    if err != nil {
        fmt.Println("Error opening file!")
        fmt.Println(err.Error())
    }
    defer xmlFile.Close()

    xmlData, err := io.ReadAll(xmlFile)
    if err != nil {
        fmt.Println("Error reading file!")
        fmt.Println(err.Error())
    }

    var response Response
    err := xml.Unmarshal(xmlData, &response)
    if err != nil {
        fmt.Println("Error unmarshaling XML")
        fmt.Println(err.Error())
    }
    fmt.Println(response)
}

This code will print out only the first two items, as if they were the only two. It will also output:

Error unmarshaling XML
XML syntax error on line 9; Invalid UTF-8

I have also tried using xml.Decoder with a CharsetReader using a different encoding, but this did not yield any different results. FWIW, I am using Windows.

Is there a way I can get around this error? Swap out the "bad" characters for something else? It was my understanding that those characters are valid UTF-8...so what gives??

Thanks in advance!

Upvotes: 4

Views: 6315

Answers (2)

Musab Gultekin
Musab Gultekin

Reputation: 363

You can check if a []byte is a valid UTF-8 encoded using: utf8.Valid

package main

import (
    "fmt"
    "unicode/utf8"
)

func main() {
    valid := []byte("Hello, 世界")
    invalid := []byte{0xff, 0xfe, 0xfd}

    fmt.Println(utf8.Valid(valid))
    fmt.Println(utf8.Valid(invalid))
}

Or if you want to just remove non-UTF8:

package main

import (
    "bytes"
    "strings"
    "unicode/utf8"
)

var removeNonUTF = func(r rune) rune {
    if r == utf8.RuneError {
        return -1
    }
    return r
}

// RemoveNonUTF8Strings removes strings that isn't UTF-8 encoded 
func RemoveNonUTF8Strings(string string) string {
    return strings.Map(removeNonUTF, string)
}

// RemoveNonUTF8Bytes removes bytes that isn't UTF-8 encoded
func RemoveNonUTF8Bytes(data []byte) []byte {
    return bytes.Map(removeNonUTF, data)
}

Upvotes: 2

Marsel Novy
Marsel Novy

Reputation: 1817

Reader that filters out invalid UTF-8 characters

package main

    import (
    "bufio"
    "io"
    "unicode"
    "unicode/utf8"
    )

    // ValidUTF8Reader implements a Reader which reads only bytes that constitute valid UTF-8
    type ValidUTF8Reader struct {
        buffer *bufio.Reader
    }

    // Function Read reads bytes in the byte array b. n is the number of bytes read.
    func (rd ValidUTF8Reader) Read(b []byte) (n int, err error) {
        for {
            var r rune
            var size int
            r, size, err = rd.buffer.ReadRune()
            if err != nil {
                return
            }
            if r == unicode.ReplacementChar && size == 1 {
                continue
            } else if n+size < len(b) {
                utf8.EncodeRune(b[n:], r)
                n += size
            } else {
                rd.buffer.UnreadRune()
                break
            }
        }
        return
    }

    // NewValidUTF8Reader constructs a new ValidUTF8Reader that wraps an existing io.Reader
    func NewValidUTF8Reader(rd io.Reader) ValidUTF8Reader {
        return ValidUTF8Reader{bufio.NewReader(rd)}
    }

taken from here

Upvotes: 6

Related Questions