Reputation: 29
all So, I'm trying to figure out how to make a simple regex code for Visual Basic.net, but am not getting anywhere.
I'm parsing csv files into a list of array, but the source csv's are anything but pristine. There are extra/rogue quotes in just enough places to crash the program, and enough sets of quotes to make fixing the data manually cumbersome.
I've written in a bunch of error-checking, and it works about 99.99% of the time. However, with 10,000 lines to parse for each folder, that averages one error per set of csv files. Crash. To get that last 0.01% parsed properly, I've created an If statement that will pull out lines that have odd numbers of quotes and remove ALL of them, which triggers a manual error-check If there are zero quotes, the field processes as usual. If there's an even number of quotes, the standard Split function cannot ignore delimiters between quotes without a regex.
Could someone help me figure out a regex string that will ignore fields enclosed in quotes?
Here's the code I've been able to think up up to this point.
Thank you in advance
Using filereader1 As New Microsoft.VisualBasic.FileIO.TextFieldParser(files_(i),
System.Text.Encoding.Default) 'system text decoding adds odd characters
filereader1.TextFieldType = FieldType.Delimited
'filereader1.Delimiters = New String() {","}
filereader1.SetDelimiters(",")
filereader1.HasFieldsEnclosedInQuotes = True
For Each c As Char In whole_string
If c = """" Then cnt = cnt + 1
Next
If cnt = 0 Then 'no quotes
split_string = Split(whole_string, ",") 'split by commas
ElseIf cnt Mod 2 = 0 Then 'even number of quotes
split_string = Regex.Split(whole_string, "(?=(([^""]|.)*""([^""]|.)*"")*([^""]|.)*$)")
ElseIf cnt <> 0 Then 'odd number of quotes
whole_string = whole_string.Replace("""", " ") 'delete all quotes
split_string = Split(whole_string, ",") 'split by commas
End If
Upvotes: 0
Views: 1913
Reputation: 41838
In VB.NET, there are several ways to proceed.
Option 1
You can use this regex: ,(?![^",]*")
It matches commas that are not inside quotes: a comma ,
that is not followed (as asserted by the negative lookahead (?![^",]*")
) by characters that are neither a comma nor a quote then a quote.
In VB.NET, something like:
Dim MyRegex As New Regex(",(?![^"",]*"")")
ResultString = MyRegex.Replace(Subject, "|")
Option 2
This uses this beautifully simple regex: "[^"]*"|(,)
This is a more general solution and easy to tweak solution. For a full description, I recommend you have a look at this question about of Regex-matching or replacing... except when.... It can make a very tidy solution that is easy to maintain if you find other cases to tweak.
The left side of the alternation |
matches complete "quotes"
. We will ignore these matches. The right side matches and captures commas to Group 1, and we know they are the right ones because they were not matched by the expression on the left.
This code should work:
Imports System
Imports System.Text.RegularExpressions
Imports System.Collections.Specialized
Module Module1
Sub Main()
Dim MyRegex As New Regex("""[^""]*""|(,)")
Dim Subject As String = "LIST,410210,2-4,""PUMP, HYDRAULIC PISTON - MAIN"",1,,,"
Dim Replaced As String = myRegex.Replace(Subject,
Function(m As Match)
If (m.Groups(1).Value = "") Then
Return ""
Else
Return m.Groups(0).Value
End If
End Function)
Console.WriteLine(Replaced)
Console.WriteLine(vbCrLf & "Press Any Key to Exit.")
Console.ReadKey()
End Sub
End Module
Reference
Upvotes: 2