Ralph
Ralph

Reputation: 32294

Read lines from a file with variable line endings in Go

How can I read lines from a file where the line endings are carriage return (CR), newline (NL), or both?

The PDF specification allows lines to end with CR, LF, or CRLF.

Do I need to write my own function that uses bufio.Reader.ReadByte()?

Upvotes: 8

Views: 1681

Answers (3)

2manyvcos
2manyvcos

Reputation: 55

As mentioned by Bill S, the accepted answer may produce unintentional newlines if a CRLF is split across two calls to the Split function.

If you want the Scanner to process the individual lines as early as possible, it may not be desirable to wait for more data when this case occurs. Instead, the following solution immediately returns the line and then drop the potential following newline character afterwards.

type lineSplitter struct {
    afterCR bool
}

func (s *lineSplitter) Split(data []byte, atEOF bool) (advance int, token []byte, err error) {
    if atEOF && len(data) == 0 {
        return 0, nil, nil
    }
    if s.afterCR {
        s.afterCR = false
        if data[0] == '\n' {
            // We had a carriage return before, so this newline needs to be skipped.
            return 1, nil, nil
        }
    }
    if i := bytes.IndexAny(data, "\r\n"); i >= 0 {
        if data[i] == '\n' {
            // We have a full line terminated by a single newline.
            return i + 1, data[0:i], nil
        }
        // We have a full line terminated by either a single carriage return or carriage return and newline.
        advance = i + 1
        if len(data) == i+1 {
            // We are at the end of the input and do not know yet if the next symbol corresponds to the current carriage return or not.
            s.afterCR = true
        } else if data[i+1] == '\n' {
            advance += 1
        }
        return advance, data[0:i], nil
    }
    // If we're at EOF, we have a final, non-terminated line. Return it.
    if atEOF {
        return len(data), data, nil
    }
    // Request more data.
    return 0, nil, nil
}

Usage:

scan := bufio.NewScanner(r)
splitter := &lineSplitter{}
scan.Split(splitter.Split)

Upvotes: 1

Bill S
Bill S

Reputation: 36

While reading an older Mac generated file with only CR line endings, I ran into regression for the edge case where if CRLF is split across the buffer boundary, the accepted answer will treat them as separate line terminators. You basically need to exit early and request more data if the buffer ends with CR. This seems to solve it.

func scanLines(data []byte, atEOF bool) (advance int, token []byte, err error) {
    if atEOF && len(data) == 0 {
        return 0, nil, nil
    }
    if i := bytes.IndexAny(data, "\r\n"); i >= 0 {
        if data[i] == '\n' {
            // We have a line terminated by single newline.
            return i + 1, data[0:i], nil
        }
        // We have a line terminated by carriage return at the end of the buffer.
        if !atEOF && len(data) == i+1 {
            return 0, nil, nil
        }
        advance = i + 1
        if len(data) > i+1 && data[i+1] == '\n' {
            advance += 1
        }
        return advance, data[0:i], nil
    }
    // If we're at EOF, we have a final, non-terminated line. Return it.
    if atEOF {
        return len(data), data, nil
    }
    // Request more data.
    return 0, nil, nil
}

Upvotes: 1

kopiczko
kopiczko

Reputation: 3058

You can write custom bufio.SplitFunc for bufio.Scanner. E.g:

// Mostly bufio.ScanLines code:
func ScanPDFLines(data []byte, atEOF bool) (advance int, token []byte, err error) {
    if atEOF && len(data) == 0 {
        return 0, nil, nil
    }
    if i := bytes.IndexAny(data, "\r\n"); i >= 0 {
        if data[i] == '\n' {
            // We have a line terminated by single newline.
            return i + 1, data[0:i], nil
        }
        advance = i + 1
        if len(data) > i+1 && data[i+1] == '\n' {
            advance += 1
        }
        return advance, data[0:i], nil
    }
    // If we're at EOF, we have a final, non-terminated line. Return it.
    if atEOF {
        return len(data), data, nil
    }
    // Request more data.
    return 0, nil, nil
}

And use it like:

scan := bufio.NewScanner(r)
scan.Split(ScanPDFLines)

Upvotes: 6

Related Questions