Reputation: 32294
How can I read lines from a file where the line endings are carriage return (CR), newline (NL), or both?
The PDF specification allows lines to end with CR, LF, or CRLF.
bufio.Reader.ReadString()
and bufio.Reader.ReadBytes()
allow a single delimiter byte.
bufio.Scanner.Scan()
handles \n
optionally preceded by \r
, but not a lone \r
.
The end-of-line marker is one optional carriage return followed by one mandatory newline.
Do I need to write my own function that uses bufio.Reader.ReadByte()
?
Upvotes: 8
Views: 1681
Reputation: 55
As mentioned by Bill S, the accepted answer may produce unintentional newlines if a CRLF is split across two calls to the Split function.
If you want the Scanner to process the individual lines as early as possible, it may not be desirable to wait for more data when this case occurs. Instead, the following solution immediately returns the line and then drop the potential following newline character afterwards.
type lineSplitter struct {
afterCR bool
}
func (s *lineSplitter) Split(data []byte, atEOF bool) (advance int, token []byte, err error) {
if atEOF && len(data) == 0 {
return 0, nil, nil
}
if s.afterCR {
s.afterCR = false
if data[0] == '\n' {
// We had a carriage return before, so this newline needs to be skipped.
return 1, nil, nil
}
}
if i := bytes.IndexAny(data, "\r\n"); i >= 0 {
if data[i] == '\n' {
// We have a full line terminated by a single newline.
return i + 1, data[0:i], nil
}
// We have a full line terminated by either a single carriage return or carriage return and newline.
advance = i + 1
if len(data) == i+1 {
// We are at the end of the input and do not know yet if the next symbol corresponds to the current carriage return or not.
s.afterCR = true
} else if data[i+1] == '\n' {
advance += 1
}
return advance, data[0:i], nil
}
// If we're at EOF, we have a final, non-terminated line. Return it.
if atEOF {
return len(data), data, nil
}
// Request more data.
return 0, nil, nil
}
Usage:
scan := bufio.NewScanner(r)
splitter := &lineSplitter{}
scan.Split(splitter.Split)
Upvotes: 1
Reputation: 36
While reading an older Mac generated file with only CR line endings, I ran into regression for the edge case where if CRLF is split across the buffer boundary, the accepted answer will treat them as separate line terminators. You basically need to exit early and request more data if the buffer ends with CR. This seems to solve it.
func scanLines(data []byte, atEOF bool) (advance int, token []byte, err error) {
if atEOF && len(data) == 0 {
return 0, nil, nil
}
if i := bytes.IndexAny(data, "\r\n"); i >= 0 {
if data[i] == '\n' {
// We have a line terminated by single newline.
return i + 1, data[0:i], nil
}
// We have a line terminated by carriage return at the end of the buffer.
if !atEOF && len(data) == i+1 {
return 0, nil, nil
}
advance = i + 1
if len(data) > i+1 && data[i+1] == '\n' {
advance += 1
}
return advance, data[0:i], nil
}
// If we're at EOF, we have a final, non-terminated line. Return it.
if atEOF {
return len(data), data, nil
}
// Request more data.
return 0, nil, nil
}
Upvotes: 1
Reputation: 3058
You can write custom bufio.SplitFunc
for bufio.Scanner
. E.g:
// Mostly bufio.ScanLines code:
func ScanPDFLines(data []byte, atEOF bool) (advance int, token []byte, err error) {
if atEOF && len(data) == 0 {
return 0, nil, nil
}
if i := bytes.IndexAny(data, "\r\n"); i >= 0 {
if data[i] == '\n' {
// We have a line terminated by single newline.
return i + 1, data[0:i], nil
}
advance = i + 1
if len(data) > i+1 && data[i+1] == '\n' {
advance += 1
}
return advance, data[0:i], nil
}
// If we're at EOF, we have a final, non-terminated line. Return it.
if atEOF {
return len(data), data, nil
}
// Request more data.
return 0, nil, nil
}
And use it like:
scan := bufio.NewScanner(r)
scan.Split(ScanPDFLines)
Upvotes: 6