Reputation: 911
I have a VBA application that runs every day. It checks a folder where CSVs are downloaded automatically, and adds their contents to a database. When parsing them, I realized that certain values had commas as a part of their name. These values were contained in string literals.
So I'm trying to figure out how to parse this CSV and ignore commas that are contained in string literals. For example...
1,2,3,"This should,be one part",5,6,7 Should return
1
2
3
"This should,be one part"
5
6
7
I have been using VBA's split() function, because I don't wanna reinvent the wheel, but if I have to I guess I'll do something else.
Any suggestions would be appreciated.
Upvotes: 18
Views: 42898
Reputation: 583
Regex is slow and this is a problem of limited variability.
You have context-dependent behaviour for commas and quotes only.
So brute-force logic is easy to write, fast to run and easy to understand. This code is much faster than Regex; without timing it, maybe 5-10x faster. Important for batch jobs.
' A fast, hard-coded method for splitting a CSV string which contains quoted sections
' e.g. 1,2,"comma,Separated,Values",Comma,Separated,Values will be split to 1, 2, "Comma,Separated,Values", Comma, Separated, Values
Public Function TokenizeCsvFast(sourceLine As String)
Dim tokens() As String
ReDim tokens(1 To 1)
Dim processedTokenNumber As Long
Dim newToken As String
Dim newTokenNumber As Long
newTokenNumber = 0
Dim inQuotes As Boolean
Dim stringPosition As Long
For stringPosition = 1 To Len(sourceLine)
Dim newCharacter As String
newCharacter = Mid$(sourceLine, stringPosition, 1)
Dim newTokenComplete As Boolean
newTokenComplete = False
If newCharacter = """" Then ' Handle quotes as an explicit case
inQuotes = Not inQuotes
ElseIf newCharacter = "," Then
If inQuotes Then
' if in quotes, just build up the new token
newToken = newToken & newCharacter
Else
' Outside of quotes, a comma separates values
newTokenComplete = True
End If
ElseIf stringPosition = Len(sourceLine) Then
' The terminal token may not have a terminal comma
newToken = newToken & newCharacter
newTokenComplete = True
Else
' Build up the new token one character at a time
newToken = newToken & newCharacter
End If
If newTokenComplete Then
processedTokenNumber = processedTokenNumber + 1
' Add the completed new token to the return array
newTokenNumber = newTokenNumber + 1
If newTokenNumber > UBound(tokens) Then
ReDim Preserve tokens(1 To newTokenNumber)
End If
tokens(newTokenNumber) = newToken
' Debug.Print newToken
' Start new token afresh
newToken = ""
End If
Next
TokenizeCsvFast = tokens
End Function
Upvotes: 0
Reputation: 935
The easiest solution might be to download a CSV parser written in VBA from GitHub. There are at least three available, and I'm the author of this one:
https://github.com/PGS62/VBA-CSV
Then the answer to the OP's question is to call the function CSVRead
, passing in the example string given in the question:
CSVRead("1,2,3,""This should,be one part"",5,6,7")
which returns a 1x7 array.
Upvotes: 0
Reputation: 452
I find that solutions based on split() and join() tend to be very fast compared to looping through characters. Getting a regular expression to work is also challenging if there can be multiple commas or multiple line breaks within a quoted string. I was just working with such a file, found here.
The function below uses the same basic mechanism as the top answer, but deals with the whole file rather than just a single line. Declarations are omitted for brevity.
Function CSVToArray(sourceText, rowDelim, columnDelim, Optional stringNotInSourceText = "|/", Optional removeErrorRows = False)
'Converts CSV text to a two-dimensional array. It's fast by use of split() and join().
'To de-activate any combination of delimeter characters in quoted strings, they are first converted using the stringNotInSourceText argument
'The delimeter characters in the quoted strings are returned to their original values
'Validate stringNotInSourceText
If InStr(1, sourceText, stringNotInSourceText) > 0 Then
Debug.Print "Error: The provided stringNotInSourceText appears in the sourceText"
End If
'Make replacement delimeters
rowDelimReplacement = stringNotInSourceText & "R"
columnDelimReplacement = stringNotInSourceText & "C"
'Now, we need to separate quoted strings out so we can replace the delimeters inside them
splitQuotes = Split(sourceText, """")
'Amazing, if we loop through the array step 2, starting on 1, we get all the quoted strings
For i = 1 To UBound(splitQuotes) Step 2
splitQuotes(i) = Replace(splitQuotes(i), rowDelim, rowDelimReplacement)
splitQuotes(i) = Replace(splitQuotes(i), columnDelim, columnDelimReplacement)
Next
'Rejoin to a now disambiguated text (a rowDelim and columnDelim character are now always actual delimeters)
disambiguatedText = Join(splitQuotes, """")
'Now we can split the disambiguated text to rows, without interference from characters in quotes
rowArray = Split(disambiguatedText, rowDelim)
'Use a sample row to count the number of columns
rowSample = Split(rowArray(0), columnDelim)
rowSampleUBound = UBound(rowSample)
'Populate the two-dimensional array, restoring the original characters inside quote
Set goodRowList = CreateObject("System.Collections.ArrayList")
errorTemplate = "Error: Row #R has #U of #SU expected columns. "
errorTemplate = errorTemplate & IIf(removeErrorRows, "Row removed.", "Row kept with up to #SU columns.")
ReDim returnArray(0 To UBound(rowArray), 0 To rowSampleUBound)
On Error Resume Next 'If a row has insufficient columns, debug.print the error template but keep going
For r = 0 To UBound(returnArray, 1)
SplitRow = Split(rowArray(r), columnDelim)
rowUbound = UBound(SplitRow)
If rowUbound <> rowSampleUBound Then
Debug.Print Replace(Replace(Replace(errorTemplate, "#R", r), "#U", rowUbound), "#SU", rowSampleUBound)
ElseIf removeErrorRows Then 'Storing good rows to remove the rest at the end
goodRowList.Add r
End If
For c = 0 To rowSampleUBound
restoredValue = SplitRow(c)
restoredValue = Replace(restoredValue, rowDelimReplacement, rowDelim)
restoredValue = Replace(restoredValue, columnDelimReplacement, columnDelim)
returnArray(r, c) = restoredValue
Next
Next
On Error GoTo 0
'If removeErrorRows is set to true, this will remove the rows that were designated as having the wrong number of columns
If removeErrorRows Then
originalCount = 0
ReDim cleanArray(0 To goodRowList.Count - 1, 0 To rowSampleUBound)
For r = 0 To goodRowList.Count - 1
For c = 0 To rowSampleUBound
cleanArray(r, c) = returnArray(originalCount, c)
Next
originalCount = originalCount + 1
Next
returnArray = cleanArray
End If
CSVToArray = returnArray
End Function
Upvotes: 0
Reputation: 3958
If the source CSV has every field in double quotes, then split(strLine, """, """) may work well
Upvotes: 0
Reputation: 1
Try This! Make sure to have the "Microsoft VBScript Regular Expressions 5.5" ticked on References under Tools.
Function Splitter(line As String, n As Integer)
Dim s() As String
Dim regex As Object
Set regex = CreateObject("vbscript.regexp")
regex.IgnoreCase = True
regex.Global = True
regex.Pattern = ",(?=([^\""]*\""[^\""]*\"")*[^\""]*$)"
s = split(regex.Replace(line, "|/||\|"), "|/||\|")
Splitter = s(n - 1)
End Function
Upvotes: 0
Reputation: 444
We had a similar CSV parsing challenge in excel recently, and implemented a solution adapted from Javascript code to parse CSV data:
Function SplitCSV(csvText As String, delimiter As String) As String()
' Create a regular expression to parse the CSV values
Dim RegEx As New RegExp
' Create pattern which will match each column in the CSV, wih submatches for each of the groups in the regex
' Match Groups: Delimiter Quoted fields Standard fields
RegEx.Pattern = "(" + delimiter + "|^)(?:\""([^\""]*(?:\""\""[^\""]*)*)\""|([^\""\""" + delimiter + """]*))"
RegEx.Global = True
RegEx.IgnoreCase = True
' Create an array to hold all pattern matches (i.e. columns)
Dim Matches As MatchCollection
Set Matches = RegEx.Execute(csvText)
' Create an array to hold output data
Dim Output() As String
' Create int to track array location when iterating
Dim i As Integer
i = 0
' Manually add blank if first column is blank, since VBA regex misses this
If csvText Like ",*" Then
ReDim Preserve Output(i)
Output(i) = ""
i = i + 1
End If
' Iterate over all pattern matches and get values into output array
Dim Match As Match
Dim MatchedValue As String
For Each Match In Matches
' Check to see which kind of value we captured (quoted or unquoted)
If (Len(Match.SubMatches(1)) > 0) Then
' We found a quoted value. When we capture this value, unescape any double quotes
MatchedValue = Replace(Match.SubMatches(1), """""", """")
Else
' We found a non-quoted value
MatchedValue = Match.SubMatches(2)
End If
' Now that we have our value string, let's add it to the data array
ReDim Preserve Output(i)
Output(i) = MatchedValue
i = i + 1
Next Match
' Return the parsed data
SplitCSV = Output
End Function
Upvotes: 1
Reputation: 53
I made another variant of solution for parsing CSV files with "quoted" text strings with possible delimiters, like comma inside the double quotes. This method doesn't require regex expressions, or any other addons. Also, this code deals with multiple commas in between the quotes. Here is Subroutine for testing:
Sub SubstituteBetweenQuotesSub()
'In-string character replacement function by Maryan Hutsul 1/29/2019
Dim quote, quoteTwo As Integer
Dim oddEven As Integer
Dim i, counter As Integer
Dim byteArray() As Byte
'LineItems are lines of text read from CSV file, or any other text string
LineItems = ",,,2019NoApocalypse.ditamap,[email protected],Approver,""JC, ,Son"",Reviewer,[email protected],""God, All-Mighty,"",2019-01-29T08:47:29.290-05:00"
quote = 1
oddEven = 0
Do Until quote = 0
quote = InStr(quote, LineItems, Chr(34))
quoteTwo = InStr(quote + 1, LineItems, Chr(34))
oddEven = oddEven + 1
If oddEven Mod 2 = 1 And quote <> 0 Then
counter = 0
For i = quote To quoteTwo
byteArray = StrConv(LineItems, vbFromUnicode)
If i <> 0 Then
If byteArray(i - 1) = 44 Then '44 represents comma, can also do Chr(44)
counter = counter + 1
End If
End If
Next i
LineItems = Left(LineItems, quote - 1) & Replace(LineItems, ",", ";", quote, counter)
quote = quote + 1
ElseIf quote <> 0 Then
quote = quote + 1
End If
Loop
End Sub
Here is function to which you can pass lines from .csv, .txt or any other text files:
Function SubstituteBetweenQuotes(LineItems)
'In-string character replacement function by Maryan Hutsul 1/29/2019
'LineItems are lines of text read from CSV file, or any other text string
Dim quote, quoteTwo As Integer
Dim oddEven As Integer
Dim i, counter As Integer
Dim byteArray() As Byte
quote = 1
oddEven = 0
Do Until quote = 0
quote = InStr(quote, LineItems, Chr(34))
quoteTwo = InStr(quote + 1, LineItems, Chr(34))
oddEven = oddEven + 1
If oddEven Mod 2 = 1 And quote <> 0 Then
counter = 0
For i = quote To quoteTwo
byteArray = StrConv(LineItems, vbFromUnicode)
If i <> 0 Then
If byteArray(i - 1) = 44 Then '44 represents "," comma, can also do Chr(44)
counter = counter + 1
End If
End If
Next i
LineItems = Left(LineItems, quote - 1) & Replace(LineItems, ",", ";", quote, counter)
quote = quote + 1
ElseIf quote <> 0 Then
quote = quote + 1
End If
Loop
SubstituteBetweenQuotes = LineItems
End Function
And below is code for reading CSV file with function used:
Dim fullFilePath As String
Dim i As Integer
'fullFilePath - full link to your input CSV file
Open fullFilePath For Input As #1
row_number = 0
column_number = 0
'EOF - End Of File (1) - file #1
Do Until EOF(1)
Line Input #1, LineFromFile
LineItems = Split(SubstituteBetweenQuotes(LineFromFile), ",")
For i = LBound(LineItems) To UBound(LineItems)
ActiveCell.Offset(row_number, i).Value = LineItems(i)
Next i
row_number = row_number + 1
Loop
Close #1
All delimiters and replacement character may be modified for your needs. I Hope this is useful as I had quite a journey to solve some problems with CSV imports
Upvotes: 1
Reputation: 21
I know this is an old post, but thought this may help others. This was plagiarized/revised from http://n3wt0n.com/blog/comma-separated-values-and-quoted-commas-in-vbscript/, but works really well and is set as a function that you can pass your input line to.
Function SplitCSVLineToArray(Line, RemoveQuotes) 'Pass it a line and whether or not to remove the quotes
ReplacementString = "#!#!#" 'Random String that we should never see in our file
LineLength = Len(Line)
InQuotes = False
NewLine = ""
For x = 1 to LineLength
CurrentCharacter = Mid(Line,x,1)
If CurrentCharacter = Chr(34) then
If InQuotes then
InQuotes = False
Else
InQuotes = True
End If
End If
If InQuotes Then
CurrentCharacter = Replace(CurrentCharacter, ",", ReplacementString)
End If
NewLine = NewLine & CurrentCharacter
Next
LineArray = split(NewLine,",")
For x = 0 to UBound(LineArray)
LineArray(x) = Replace(LineArray(x), ReplacementString, ",")
If RemoveQuotes = True then
LineArray(x) = Replace(LineArray(x), Chr(34), "")
End If
Next
SplitCSVLineToArray = LineArray
End Function
Upvotes: 2
Reputation: 31
I realize this is an old post, but I just bumped into it looking for a solution to the same problem the OP had, so the thread is still relevant.
To import data from a CSV, I add a query to a worksheet
wksTarget.Querytables.add(Connection:=strConn, Destination:=wksTarget.Range("A1"))
then set the appropriate Querytable parameters (e.g. Name, FieldNames, RefreshOnOpen
, etc.)
Querytables can handle various delimiters via the TextFileCommaDelimiter
, TextFileSemiColonDelimiter
and others. And there are a number of other parameters (TextfilePlatform, TextFileTrailingMinusNumbers, TextFileColumnTypes, TextFileDecimalSeparator, TextFileStartRow, TextFileThousandsSeparator
) that handle source file idiosyncrasies.
Relevant to the OP, QueryTables also has a parameter designed to handle commas that are within double quotes - TextFileQualifier = xlTextQualifierDoubleQuote
.
I find QueryTables much simpler than writing code to import the file, split/parse strings or use REGEX expressions.
All together, a sample code snippet would look something like this:
strConn = "TEXT;" & "C:\Desktop\SourceFile.CSV"
varDataTypes = Array(5, 1, 1, 1, 1, 1, 5, 5)
With wksTarget.QueryTables.Add(Connection:=strConn, _
Destination:=wksTarget.Range("A1"))
.Name = "ImportCSV"
.FieldNames = True
.RefreshOnFileOpen = False
.SaveData = True
.TextFilePlatform = xlMSDOS
.TextFileStartRow = 1
.TextFileParseType = xlDelimited
.TextFileCommaDelimiter = True
.TextFileTextQualifier = xlTextQualifierDoubleQuote
.TextFileColumnDataTypes = varDataTypes
.Refresh BackgroundQuery:=False
End With
I prefer to delete the QueryTable once the data is imported (wksTarget.QueryTable("ImportCSV").Delete
), but I suppose it could be created just once and then simply refreshed if the source and destinations for the data don't change.
Upvotes: 1
Reputation: 2925
@Gimp said...
The current answers do not contain enough detail.
I'm running into the same problem. Looking for more detail in this answer.
To elaborate on @MRAB's answer:
Function ParseCSV(FileName)
Dim Regex 'As VBScript_RegExp_55.RegExp
Dim MatchColl 'As VBScript_RegExp_55.MatchCollection
Dim Match 'As VBScript_RegExp_55.Match
Dim FS 'As Scripting.FileSystemObject
Dim Txt 'As Scripting.TextStream
Dim CSVLine
ReDim ToInsert(0)
Set FS = CreateObject("Scripting.FileSystemObject")
Set Txt = FS.OpenTextFile(FileName, 1, False, -2)
Set Regex = CreateObject("VBScript.RegExp")
Regex.Pattern = """[^""]*""|[^,]*" '<- MRAB's answer
Regex.Global = True
Do While Not Txt.AtEndOfStream
ReDim ToInsert(0)
CSVLine = Txt.ReadLine
For Each Match In Regex.Execute(CSVLine)
If Match.Length > 0 Then
ReDim Preserve ToInsert(UBound(ToInsert) + 1)
ToInsert(UBound(ToInsert) - 1) = Match.Value
End If
Next
InsertArrayIntoDatabase ToInsert
Loop
Txt.Close
End Function
You need to customize the InsertArrayIntoDatabase Sub for your own table. Mine has several text fields named f00, f01, etc...
Sub InsertArrayIntoDatabase(a())
Dim rs As DAO.Recordset
Dim i, n
Set rs = CurrentDb().TableDefs("tbl").OpenRecordset()
rs.AddNew
For i = LBound(a) To UBound(a)
n = "f" & Format(i, "00") 'fields in table are f00, f01, f02, etc..
rs.Fields(n) = a(i)
Next
rs.Update
End Sub
Note that instead of using CurrentDb()
in InsertArrayIntoDatabase()
, you should really use a global variable that gets set to the value of CurrentDb()
before ParseCSV()
runs, because running CurrentDb()
in a loop is very slow, especially on a very large file.
Upvotes: 12
Reputation: 1057
The first way to solve this problem is to look at the structure of the line from the csv file (int,int,"String literal, will have at most one comma", etc). A naive solution would be (Assuming that the line don't have any semicolons)
Function splitLine1(line As String) As String()
Dim temp() As String
'Splits the line in three. The string delimited by " will be at temp(1)
temp = Split(line, Chr(34)) 'chr(34) = "
'Replaces the commas in the numeric fields by semicolons
temp(0) = Replace(temp(0), ",", ";")
temp(2) = Replace(temp(2), ",", ";")
'Joins the temp array with quotes and then splits the result using the semicolons
splitLine1 = Split(Join(temp, Chr(34)), ";")
End Function
This function only solves this particular problem. Another way to do the job is using the regular expression object from VBScript.
Function splitLine2(line As String) As String()
Dim regex As Object
Set regex = CreateObject("vbscript.regexp")
regex.IgnoreCase = True
regex.Global = True
'This pattern matches only commas outside quotes
'Pattern = ",(?=([^"]*"[^"]*")*(?![^"]*"))"
regex.Pattern = ",(?=([^" & Chr(34) & "]*" & Chr(34) & "[^" & Chr(34) & "]*" & Chr(34) & ")*(?![^" & Chr(34) & "]*" & Chr(34) & "))"
'regex.replaces will replace the commas outside quotes with semicolons and then the
'Split function will split the result based on the semicollons
splitLine2 = Split(regex.Replace(line, ";"), ";")
End Function
This method seems much more cryptic, but does not deppends on the structure of the line
You can read more about regular expressions patterns in VBScript Here
Upvotes: 17
Reputation: 91366
If you are working with MS Access tables, there are advantages in simply importing text from disk. For example:
''If you have a reference to the Windows Script Host Object Model
Dim fs As New FileSystemObject
Dim ts As TextStream
''For late binding
''Dim fs As Object
''Dim ts As Object
''Set fs=CreateObject("Scripting.FileSystemObject")
Set ts = fs.CreateTextFile("z:\docs\import.csv", True)
sData = "1,2,3,""This should,be one part"",5,6,7"
ts.Write sData
ts.Close
''Just for testing, your table will already exist
''sSQL = "Create table Imports (f1 int, f2 int, f3 int, f4 text, " _
'' & "f5 int, f6 int, f7 int)"
''CurrentDb.Execute sSQL
''The fields will be called F1,F2 ... Fn in the text file
sSQL = "INSERT INTO Imports SELECT * FROM " _
& "[text;fmt=delimited;hdr=no;database=z:\docs\].[import.csv]"
CurrentDb.Execute sSQL
Upvotes: 3
Reputation: 5370
Taking your comments into account you could take the easy way out here
Upvotes: 0
Reputation: 20664
A simple regex for parsing a CSV line, assuming no quotes inside quoted fields, is:
"[^"]*"|[^,]*
Each match will return a field.
Upvotes: 12