Reputation:
I have a csv file of 1.2 million records of text. The alphanumeric fields are wrapped in quotation marks, the date/time or numeric fields are not.
For example "Fred","Smith",01/07/1967,2,"7, The High Street","Anytown","Anycounty","LS1 7AA"
What I want do is write some VBA in Excel (more or less the only tool available to me that I am reasonably proficient in the use of) that reads the CSV record by record, performs a check (as it happens on the last field, the post code) and then outputs a small subset of the 1.2m records to a new output file.
I understand how to open the two files, read the record, do what I need to do with the data and write it out (I will just output the input record with a prefix denoting an exception type)
What I don't know is how to parse the CSV in VBA properly. I can't do a simple text scan and search for commas as the text sometimes has commas in (hence why the text fields are text delimited)
Is there a fantastic command that would let me quicky get the data from the nth field in my record?
What I want is s_work = field(s_input_record,5) where 5 is the field number in my CSV....
Many thanks, C
Upvotes: 11
Views: 42074
Reputation: 119154
The following code should do the trick. I don't have Excel in front of me, so I haven't tested it, but the concept is sound.
If this ends up being too slow, we can look at ways to improve the efficiency.
Sub SelectSomeRecords()
Dim testLine As String
Open inputFileName For Input As #1
Open outputFileName For Output As #2
While Not EOF(1)
Line Input #1, testLine
If RecordIsInteresting(testLine) Then
Print #2, testLine
End If
Wend
Close #1
Close #2
End Sub
Function RecordIsInteresting(recordLine As String) As Boolean
Dim lineItems(1 to 8) As String
GetRecordItems(lineItems(), recordLine)
''// do your custom checking here:
RecordIsInteresting = lineItems(8) = "LS1 7AA"
End Function
Sub GetRecordItems(items() As String, recordLine as String)
Dim finishString as Boolean
Dim itemString as String
Dim itemIndex as Integer
Dim charIndex as Long
Dim inQuote as Boolean
Dim testChar as String
inQuote = False
charIndex = 1
itemIndex = 1
itemString = ""
finishString = False
While charIndex <= Len(recordLine)
testChar = Mid$(recordLine, charIndex, 1)
finishString = False
If inQuote Then
If testChar = Chr$(34) Then
inQuote = False
finishString = True
charIndex = charIndex + 1 ''// ignore the next comma
Else
itemString = itemString + testChar
End If
Else
If testChar = Chr$(34) Then
inQuote = True
ElseIf testChar = "," Then
finishString = True
Else
itemString = itemString + testChar
End If
End If
If finishString Then
items(itemIndex) = itemString
itemString = ""
itemIndex = itemIndex + 1
End If
charIndex = charIndex + 1
Wend
End Sub
Upvotes: 8
Reputation: 37655
I used the following derivative of the code given above to successfully open an arbitrary csv file from VBA in Excel.
Option Explicit
Public cn As Connection
Public Sub DoIt()
Dim strcon As String
Dim strsql As String
Dim rs As RecordsetSet cn = CreateObject("ADODB.Connection")
strcon = "Provider=Microsoft.Jet.OLEDB.4.0;Data Source=C:\bin\HomePlanet\;" _
& "Extended Properties=""text;HDR=Yes;FMT=Delimited"";"cn.Open strcon
strsql = "SELECT * FROM astuname.csv "
Set rs = New ADODB.Recordset
rs.Open strsql, cn
DoEvents ' pause here to inspect objects and properties rs.Close
End Sub
The rs (recordset) has a collection of fields, with a Count property. Each field as a Type property.
You can reference the fields by sequence number ...
Debug.Print rs.Fields(rs.Fields.Count - 1).Type
Is this sufficient?
If not, post the first several rows of the input file and I'll take it the rest of the way.
Upvotes: 3
Reputation: 37655
Anything you can do a-row-at-a-time with vba in excel, you can do in access with vba; plus a lot more because it's a database rather than a spreadsheet. Is access unavailable to you?
It's a lot easier to deal with logical tables, records, and fields than logical worksheets, rows, and columns.
For input, why does the "/Data/Import External Data/Text/csv" not work? Is the input not truly portable csv?
Upvotes: 2
Reputation: 10679
Look at the Input #
statement in the Excel help
Sample usage would be:
Input #fnInput, s_Forename, s_Surname, dt_DOB, i_Something, s_Street, s_Town, s_County, s_Postcode
and then use the Write #
statement to write matching records out again
The only issue might be that the date format in the output will end up as #1967-07-01# but this format is unambiguous unlike 01/07/1967 which would represent 1st July in the UK and 7th January in the US. If you need to preserve the formatting of the date then write it out as a string:
s_DOB = Format(dt_DOB, "dd/mm/yyyy")
Upvotes: 2
Reputation: 52326
I'd suggest taking a look at the Regular Expression library (you should see it in "Tools...References" as "Microsoft VBScript Regular Expressions 5.5" or something very similar.
There are samples of both the Reg Exp and a fairly comprehensive character-by-character at this location: http://www.xbeat.net/vbspeed/c_ParseCSV.php. Note that the Regexp version is waaaay shorter!
Have fun...
Upvotes: 1
Reputation: 71969
This doesn't directly answer your question, but grep
(or one of the Windows equivalents) would really shine for this, e.g.,
grep -e <regex_filter> foo.csv > bar.csv
Upvotes: 4
Reputation: 91366
How about VBScript, though this would also work in Excel:
Set cn = CreateObject("ADODB.Connection")
'Note HDR=Yes, that is, first row contains field names '
'and FMT delimted, ie CSV '
strCon="Provider=Microsoft.Jet.OLEDB.4.0;Data Source=c:\Docs\;" _
& "Extended Properties=""text;HDR=Yes;FMT=Delimited"";"
cn.open strcon
'You would not need delimiters ('') if last field is numeric: '
strSQL="SELECT FieldName1, FieldName2 INTO New.csv FROM Old.csv " _
& " WHERE LastFieldName='SomeTextValue'"
'Creates new csv file
cn.Execute strSQL
Upvotes: 8