Dave
Dave

Reputation: 59

Replace Lf with CrLf in files

Is there a low cost way to test the first line in a file for a LF terminator instead of a CRLF?

We receive a lot of files from customers and a few of them send us EOL terminators as LF's instead of CRLF. We're using SSIS to import so I need the row terminators to be the same. (when I open the file in Notepad++ I can see the lines end with LF instead of CRLF)

If I read the first line of a file into a StreamReader ReadLine, the line looks like it doesn't contain any type of terminator. I tested for line.Contains(vbLf) and vbCr and vbCrLf and all came back false.

I think I can read the entire file into memory and test for vbLf but some of the files we receive are pretty large (25MB) and it seems like a huge resource waste just to check the line terminator in the first row. Worst case is I can rewrite every line in every file we receive with a line + System.Environment.NewLine but again that a waste for files that already use CRLF.

EDIT Final Code Below based on the answer from @icemanind (SSIS script task passing in directory variable)

Public Sub Main()
'Gets the directory and a listing of the files and calls the sub

    Dim sPath As String
    sPath = Dts.Variables("User::DataSourceDir").Value.ToString
    Dim sDirectory As String = sPath
    Dim dirList As New DirectoryInfo(sDirectory)
    Dim fileList As FileInfo() = dirList.GetFiles()

    For Each fileName As FileInfo In fileList
        ReplaceBadEol(fileName)
    Next

    Dts.TaskResult = ScriptResults.Success
End Sub

'Temp filename postfix
Private Const fileNamePostFix As String = "_Temp.txt"

'Tests to see if the file has a valid end of line terminator and fixes if it doesn't
Private Sub ReplaceBadEol(currentFileInfo As FileInfo)
    Dim fullName As String = currentFileInfo.FullName
    If FirstLineEndsWithCrLf(fullName) Then Exit Sub
    Dim fileContent As String() = GetFileContent(currentFileInfo.FullName)
    Dim pureFileName As String = Path.GetFileNameWithoutExtension(fullName)
    Dim newFileName As String = Path.Combine(currentFileInfo.DirectoryName, pureFileName & fileNamePostFix)
    File.WriteAllLines(newFileName, fileContent)
    currentFileInfo.Delete()
    File.Move(newFileName, fullName)
End Sub

'Enum to provide info on the return
Private Enum Terminators
    None = 0
    CrLf = 1
    Lf = 2
    Cr = 3
End Enum

'Eol test reads file, advances to the end of the first line and evaluates the value
Private Function GetTerminator(fileName As String, length As Integer) As Terminators
    Using sr As New StreamReader(fileName)
        sr.BaseStream.Seek(length, SeekOrigin.Begin)
        Dim data As Integer = sr.Read()

        While data <> -1
            If data = 13 Then
                data = sr.Read()
                If data = 10 Then
                    Return Terminators.CrLf
                End If
                Return Terminators.Cr
            End If
            If data = 10 Then
                Return Terminators.Lf
            End If
            data = sr.Read()
        End While
    End Using

    Return Terminators.None
End Function

'Checks if file is empty, if not check for EOL terminator
Private Function FirstLineEndsWithCrLf(fileName As String) As Boolean

    Using reader As New System.IO.StreamReader(fileName)
        Dim line As String = reader.ReadLine()
        Dim length As Integer = line.Length
        Dim fileEmpty As Boolean = String.IsNullOrWhiteSpace(line)

        If fileEmpty = True Then
            Return True
        Else
            If GetTerminator(fileName, length) <> 1 Then
                Return False
            End If
            Return True
        End If

    End Using

End Function

'Reads all lines into String Array
Private Function GetFileContent(fileName As String) As String()
    Return File.ReadAllLines(fileName)
End Function

Upvotes: 0

Views: 3614

Answers (1)

Icemanind
Icemanind

Reputation: 48736

The reason your lines are testing negative for VbCrLf, VbLf and VbCr is because ReadLine strips those. From the StreamReader.ReadLine documentation:

A line is defined as a sequence of characters followed by a line feed ("\n"), 
a carriage return ("\r"), or a carriage return immediately followed by a line 
feed ("\r\n"). The string that is returned does not contain the terminating 
carriage return or line feed.

If you want all the lines, concatenated with a carriage return, try this:

Dim lines As String() = File.ReadAllLines("myfile.txt")
Dim data As String = lines.Aggregate(Function(i, j) i + VbCrLf + j)

This will read in all the lines of your file, then use some Linq to concatenate them all with a carriage return and line feed.

EDIT

If you are looking just to determine what the first line break character is, try this function:

Private Enum Terminators
    None = 0
    CrLf = 1
    Lf = 2
    Cr = 3
End Enum

Private Shared Function GetTerminator(fileName As String) As Terminators
    Using sr = New StreamReader(fileName)
        Dim data As Integer = sr.Read()

        While data <> -1
            If data = 13 Then
                data = sr.Read()
                If data = 10 Then
                    Return Terminators.CrLf
                End If
                Return Terminators.Cr
            End If
            If data = 10 Then
                Return Terminators.Lf
            End If
            data = sr.Read()
        End While
    End Using

    Return Terminators.None
End Function

Just call this function, passing in a filename and it will return "Cr", "Lf", "CrLf" or "None" if there are no line terminators.

Upvotes: 2

Related Questions