Reputation: 1289
I have built a simple webscraper in VBA that extracts a table from Google Patents and stores the innerHTML in a .txt file (named something like 1234567.txt) for about 23,000 patents. Now I want to analyze the content of each file. To do so, I hope to import the txt files into VBA so that I can do some string searches but that seems to be ludicrously difficult. I read about 20 solutions to import .txt files into VBA but none work for my files that look like this:
US6824791 B2 ' There is a shift + Enter here
<TD class="patent-data-table-td citation-patent"><A href="/patents/US7767249">US7767249</A></TD>
<TD class="patent-data-table-td patent-date-value">Jul 25, 2005</TD>
<TD class="patent-data-table-td patent-date-value">Aug 3, 2010</TD>
<TD class="patent-data-table-td ">Hewlett-Packard Development Company, L.P.</TD>
<TD class="patent-data-table-td ">Preparation of nanoparticles</TD></TR>
<TR>
<TD class="patent-data-table-td citation-patent"><A href="/patents/US7935853">US7935853</A><SPAN class=patent-tooltip-anchor aria-label="Cited by examiner" data-tooltip-text="Cited by examiner" data-tooltip="Cited by examiner" a="null"> *</SPAN></TD>
<TD class="patent-data-table-td patent-date-value">Oct 8, 2009</TD>
<TD class="patent-data-table-td patent-date-value">May 3, 2011</TD>
<TD class="patent-data-table-td ">Bobelium S.L.</TD>
<TD class="patent-data-table-td ">Micronized composition of a 2,4-disubstituted phenol derivative</TD></TR>
<TR>
<TD class="patent-data-table-td citation-patent"><A href="/patents/US8524829">US8524829</A></TD>
<TD class="patent-data-table-td patent-date-value">Jun 17, 2008</TD>
<TD class="patent-data-table-td patent-date-value">Sep 3, 2013</TD>
<TD class="patent-data-table-td ">Brown University Research Foundation</TD>
<TD class="patent-data-table-td ">Methods for micronization of hydrophobic drugs</TD></TR>
<TR>
<TD class="patent-data-table-td citation-patent"><A href="/patents/EP2422804A1?cl=en">EP2422804A1</A></TD>
<TD class="patent-data-table-td patent-date-value">Jun 16, 2005</TD>
<TD class="patent-data-table-td patent-date-value">Feb 29, 2012</TD>
<TD class="patent-data-table-td ">Amano Enzyme USA., Ltd.</TD>
<TD class="patent-data-table-td ">Controlled release formulations of enzymes, microorganisms, and antibodies with mucoadhesive polymers</TD></TR></TBODY></TABLE>
' There is a shift + Enter here
So despite the fact that the files are neatly structured and repetitive, it seems to be very difficult to import them as a single string. I basically want to loop through the file, and extract the patent number and the 2 dates mentioned (using Mid
and InStr
) and put them in three different columns. That's what I think is best but I'm keen to hear smarter suggestions!
It might be relevant to know that every .txt file has a different length (number of rows), but I can estimate the exact number of rows with 99.9% certainty (1 (title) + 6 * total number of citations (which I know) - 1 (last <TR> is missing)
Thanks in advance
Simon
EDIT: some examples of things I tried. This codes were taken from online sources. I am not really sure how they should work and it is very possible that I have applied them incorrectly.
`Sub Text2Excel_Click()
Dim sourcestring as String
sourcestring = GetText("C\users\...\test.txt")
sourcestring = OpenTextFileToString("C\users\...\test.txt")
Function GetText(sFile As String) As String
Dim sText As String
Dim nSourceFile As Integer
''Close any open text files
Close
''Get the number of the next free text file
nSourceFile = FreeFile
''Write the entire file to sText
Open sFile For Input As #nSourceFile
sText = Input$(LOF(1), 1)
Close
GetText = sText
End Function
Function OpenTextFileToString(ByVal strFile As String) As String
' RB Smissaert - Author
Dim hFile As Long
hFile = FreeFile
Open strFile For Input As #hFile
OpenTextFileToString = Input$(LOF(hFile), hFile)
Close #hFile
End Function
I also tried the following suggestion: but that gives me the runtime error 424 "object required".
Upvotes: 0
Views: 380
Reputation: 3183
If you are just trying to get the text file into 1 string in vba then you can use the following function.
NOTE: This requires you to add a reference to Microsoft Scripting Runtime.
Public Function ReadFileAsString(ByVal filePath As String) As String
Dim fso As FileSystemObject
Set fso = New FileSystemObject
Set txtstream = fso.OpenTextFile(filePath, ForReading, False)
Dim sourceString As String
Do While Not txtstream.AtEndOfStream
sourceString = sourceString + txtstream.ReadLine
Loop
ReadFileAsString = sourceString
txtstream.Close
Set fso = Nothing
Set txtstream = Nothing
End Function
Then you can do InStr() and Mid() etc on the string the above function returns. To see the text file try:
Public Sub Test()
Debug.Print ReadFileAsString("C:\Users\ausername\Desktop\test.txt")
End Sub
Upvotes: 1