Reputation: 219
I have a text file that is automatically generated by an older computer system daily.
Unfortunately, the columns in this file are not delimited and they are not exactly fixed width (each day the width of each column could change depending on the amount of chars of the data in each column). The file does have column headings, so I want to find the width of each column using the column headings. Here is an example of the column heading row:
JOB_NO[variable amount of white space chars]FILE_NAME[variable amount of ws chars]PROJECT_CODE[variable amount of ws chars][carriage return]
What I want to do is get the index of of the first char in a column and the index of the last white space of a column (from the column heading). I would want to get the index of the "J" in JOB_NUM and the last white space before the "F" in FILE_NAME for the first column.
I guess I should also mention that the columns may not always be in the same order from day to day but they will have the same header names.
Any thoughts about how do do this in VB.net or c#? I know I can use the string.indexOf("JOB_NO") to get the index of the start of the column, but how do I get the index of the last white space in each column? (or last whitespace before the next first non-whitespace that denotes the start of the next column)
Upvotes: 4
Views: 4205
Reputation: 604
Here is an alternative answer using a small class which you can later use to parse your lines. You can use the fields collection as a template to pull the fields for each of your lines, this solution does not ignore the whitespaces as I presume that they are variable because the fields vary in length each day and you would need that data:
Imports System.Text.RegularExpressions
Module Module1
Sub Main()
Dim line As String = "JOB_NUM FILE_NAME SOME_OTHER_THING "
Dim Fields As List(Of Field) = New List(Of Field)
Dim oField As Field = Nothing
Dim mc As MatchCollection = Regex.Matches(
line, "(?<=^| )\w")
For Each m As Match In mc
oField = New Field
oField.Start = m.Index
'Loop through the matches
If m.NextMatch.Index = 0 Then
'This is the last field
oField.Length = line.Length - oField.Start
Else
oField.Length = m.NextMatch.Index - oField.Start
End If
oField.Name = line.Substring(oField.Start, oField.Length)
'Trim the field name:
oField.Name = Trim(oField.Name)
'Add to the list
Fields.Add(oField)
Next
'Check the Fields: you can use line.substring(ofield.start, ofield.length)
'to parse each line of your file.
For Each f As Field In Fields
Console.WriteLine("Field Name: " & f.Name)
Console.WriteLine("Start: " & f.Start)
Console.WriteLine("Length " & f.Length)
Next
Console.Read()
End Sub
Class Field
Public Property Name As String
Public Property Start As Integer
Public Property Length As Integer
End Class
End Module
Upvotes: 0
Reputation: 120450
Borrowing heavily from a previous answer I've given... To get column positions, how about this? I'm making the assumption that column names do not contain spaces.
IEnumerable<int> positions=Regex
.Matches("JOB_NUM FILE_NAME SOME_OTHER_THING",@"(?<=^| )\w")
.Cast<Match>()
.Select(m=>m.Index);
or (verbose version of the above)
//first get a MatchCollection
//this regular expression matches a word character that immediately follows
//either the start of the line or a space, i.e. the first char of each of
//your column headers
MatchCollection matches=Regex
.Matches("JOB_NUM FILE_NAME SOME_OTHER_THING",@"(?<=^| )\w");
//convert to IEnumerable<Match>, so we can use Linq on our matches
IEnumerable<Match> matchEnumerable=matches.Cast<Match>();
//For each match, select its Index
IEnumerable<int> positions=matchEnumerable.Select(m=>m.Index);
//convert to array (if you want)
int[] pos_arr=positions.ToArray();
Upvotes: 0
Reputation: 18797
Get the indexes of all columns. e.g.
var jPos = str.IndexOf("JOB_NO");
var filePos = str.IndexOf("FILE_NAME");
var projPos = str.IndexOf("PROJECT_CODE");
Then sort them in an array. from min to max. now you know your columns order. the last space of first column is [the_next_column's_index]-1.
int firstColLastSpace = ar[1] -1;
int secColLastSpace = ar[2] -1;
Upvotes: 3