Reputation: 59
Is there a low cost way to test the first line in a file for a LF terminator instead of a CRLF?
We receive a lot of files from customers and a few of them send us EOL terminators as LF's instead of CRLF. We're using SSIS to import so I need the row terminators to be the same. (when I open the file in Notepad++ I can see the lines end with LF instead of CRLF)
If I read the first line of a file into a StreamReader ReadLine, the line looks like it doesn't contain any type of terminator. I tested for line.Contains(vbLf) and vbCr and vbCrLf and all came back false.
I think I can read the entire file into memory and test for vbLf but some of the files we receive are pretty large (25MB) and it seems like a huge resource waste just to check the line terminator in the first row. Worst case is I can rewrite every line in every file we receive with a line + System.Environment.NewLine but again that a waste for files that already use CRLF.
EDIT Final Code Below based on the answer from @icemanind (SSIS script task passing in directory variable)
Public Sub Main()
'Gets the directory and a listing of the files and calls the sub
Dim sPath As String
sPath = Dts.Variables("User::DataSourceDir").Value.ToString
Dim sDirectory As String = sPath
Dim dirList As New DirectoryInfo(sDirectory)
Dim fileList As FileInfo() = dirList.GetFiles()
For Each fileName As FileInfo In fileList
ReplaceBadEol(fileName)
Next
Dts.TaskResult = ScriptResults.Success
End Sub
'Temp filename postfix
Private Const fileNamePostFix As String = "_Temp.txt"
'Tests to see if the file has a valid end of line terminator and fixes if it doesn't
Private Sub ReplaceBadEol(currentFileInfo As FileInfo)
Dim fullName As String = currentFileInfo.FullName
If FirstLineEndsWithCrLf(fullName) Then Exit Sub
Dim fileContent As String() = GetFileContent(currentFileInfo.FullName)
Dim pureFileName As String = Path.GetFileNameWithoutExtension(fullName)
Dim newFileName As String = Path.Combine(currentFileInfo.DirectoryName, pureFileName & fileNamePostFix)
File.WriteAllLines(newFileName, fileContent)
currentFileInfo.Delete()
File.Move(newFileName, fullName)
End Sub
'Enum to provide info on the return
Private Enum Terminators
None = 0
CrLf = 1
Lf = 2
Cr = 3
End Enum
'Eol test reads file, advances to the end of the first line and evaluates the value
Private Function GetTerminator(fileName As String, length As Integer) As Terminators
Using sr As New StreamReader(fileName)
sr.BaseStream.Seek(length, SeekOrigin.Begin)
Dim data As Integer = sr.Read()
While data <> -1
If data = 13 Then
data = sr.Read()
If data = 10 Then
Return Terminators.CrLf
End If
Return Terminators.Cr
End If
If data = 10 Then
Return Terminators.Lf
End If
data = sr.Read()
End While
End Using
Return Terminators.None
End Function
'Checks if file is empty, if not check for EOL terminator
Private Function FirstLineEndsWithCrLf(fileName As String) As Boolean
Using reader As New System.IO.StreamReader(fileName)
Dim line As String = reader.ReadLine()
Dim length As Integer = line.Length
Dim fileEmpty As Boolean = String.IsNullOrWhiteSpace(line)
If fileEmpty = True Then
Return True
Else
If GetTerminator(fileName, length) <> 1 Then
Return False
End If
Return True
End If
End Using
End Function
'Reads all lines into String Array
Private Function GetFileContent(fileName As String) As String()
Return File.ReadAllLines(fileName)
End Function
Upvotes: 0
Views: 3614
Reputation: 48736
The reason your lines are testing negative for VbCrLf, VbLf and VbCr is because ReadLine strips those. From the StreamReader.ReadLine documentation:
A line is defined as a sequence of characters followed by a line feed ("\n"),
a carriage return ("\r"), or a carriage return immediately followed by a line
feed ("\r\n"). The string that is returned does not contain the terminating
carriage return or line feed.
If you want all the lines, concatenated with a carriage return, try this:
Dim lines As String() = File.ReadAllLines("myfile.txt")
Dim data As String = lines.Aggregate(Function(i, j) i + VbCrLf + j)
This will read in all the lines of your file, then use some Linq to concatenate them all with a carriage return and line feed.
EDIT
If you are looking just to determine what the first line break character is, try this function:
Private Enum Terminators
None = 0
CrLf = 1
Lf = 2
Cr = 3
End Enum
Private Shared Function GetTerminator(fileName As String) As Terminators
Using sr = New StreamReader(fileName)
Dim data As Integer = sr.Read()
While data <> -1
If data = 13 Then
data = sr.Read()
If data = 10 Then
Return Terminators.CrLf
End If
Return Terminators.Cr
End If
If data = 10 Then
Return Terminators.Lf
End If
data = sr.Read()
End While
End Using
Return Terminators.None
End Function
Just call this function, passing in a filename and it will return "Cr", "Lf", "CrLf" or "None" if there are no line terminators.
Upvotes: 2