Vejto
Vejto

Reputation: 1314

CSV parser in Go breaks due to trailing space

We are trying to parse a csv file using Go's encoding/csv package. This particular csv is a bit peculiar, each row has a trailing space. When trying to decode this csv with quoted fields the package breaks since it expects a newline, separator or quote. The trailing space is not expected.

How would you handle this case? Do you know of another parser that we could use?

Edit:

f,err := os.Open("file.go")
// err etc..
csvr := csv.NewReader(f)
csvr.Comma = csvDelimiter
for {
   rowAsSlice, err := csvr.Read()
   // Handle row and errors etc.
}

Edit 2: CSV example, mind the trailing space!

"RECORD_TYPE","COMPANY_SHORTNAME" 
"HDR","COMPANY_EXAMPLE" 

Upvotes: 2

Views: 1943

Answers (1)

maerics
maerics

Reputation: 156444

One possible solution is to wrap the source file reader in a custom reader whose Read(...) method silently trims trailing whitespace from what the underlying reader actually reads. The csv.Reader could use that type directly.

For example (Go Playground):

type TrimReader struct{ io.Reader }

var trailingws = regexp.MustCompile(` +\r?\n`)

func (tr TrimReader) Read(bs []byte) (int, error) {
  // Perform the requested read on the given reader.
  n, err := tr.Reader.Read(bs)
  if err != nil {
    return n, err
  }

  // Remove trailing whitespace from each line.
  lines := string(bs[:n])
  trimmed := []byte(trailingws.ReplaceAllString(lines, "\n"))
  copy(bs, trimmed)
  return len(trimmed), nil
}

func main() {
  file, err := file.Open("myfile.csv")
  // TODO: handle err...

  csvr := csv.NewReader(TrimReader{file})

  for {
    record, err := csvr.Read()
    if err == io.EOF {
      break
    }
    fmt.Printf("LINE: record=%#v, err=%v\n", record, err)
  }
  // LINE: record=[]string{"RECORD_TYPE", "COMPANY_SHORTNAME"}, err=<nil>
  // LINE: record=[]string{"HDR", "COMPANY_EXAMPLE"}, err=<nil>
}

Note that, as commenter @svsd points out, there is a subtle bug here wherein trailing whitespace can still make it through if the line terminator isn't read until the subsequent call. You can workaround by buffering or, perhaps best, simply preprocess these CSV files to remove the trailing whitespace before attempting to parse them.

Upvotes: 3

Related Questions