Ashat Usbekof
Ashat Usbekof

Reputation: 35

How to extract specific words from text files into xls spreadsheet

I'm new in VBA. Before posting my question here,I have spent almost 3 days surfing Internet.

I have 300+ text files (text converted from PDF using OCR),from text file. I need to get all words that contain "alphabet" and "digits" (as example KT315A, KT-315-a, etc) along with source reference (txt file name).

What I need is

1.add "smart filter" that will copy only words that contains
"alphabets" and "digits"

  1. paste copied data to column A

  2. add reference file name to column B

I have found code below that can copy all data from text files into excel spreadsheet.

text files look like

"line from 252A-552A to ddddd, ,,, @,@,rrrr, 22 , ....kt3443 , fff,,,etc"

final result in xls should be

A | B

252A-552A | file1

kt3443 | file1

 Option Explicit


Const sPath = "C:\outp\" 'remember end backslash
Const delim = "," 'comma delimited text file - EDIT
'Const delim = vbTab  'for TAB delimited text files


Sub ImportMultipleTextFiles()
   Dim wb As Workbook
   Dim sFile As String
   Dim inputRow As Long

   RefreshSheet

   On Error Resume Next
   sFile = Dir(sPath & "*.txt")

   Do Until sFile = ""
      inputRow = Sheets("Temp").Range("A" & Rows.Count).End(xlUp).Row + 1

      'open the text file
'format=6 denotes a text file
      Set wb = Workbooks.Open(Filename:=sPath & sFile, _
         Format:=6, _
         Delimiter:=delim)

      'copy and paste
      wb.Sheets(1).Range("A1").CurrentRegion.Copy _
         Destination:=ThisWorkbook.Sheets("Temp").Range("A" & inputRow)
      wb.Close SaveChanges:=False

      'get next text file
      sFile = Dir()
   Loop

   Set wb = Nothing
End Sub


Sub RefreshSheet()
   'delete old sheet and add a new one
   On Error Resume Next

   Application.DisplayAlerts = False
      Sheets("Temp").Delete
   Application.DisplayAlerts = True

   Worksheets.Add
   ActiveSheet.Name = "Temp"

   On Error GoTo 0
End Sub

thanks!

Upvotes: 0

Views: 1633

Answers (1)

Ron Rosenfeld
Ron Rosenfeld

Reputation: 60474

It's a little tough to tell exactly what constitutes a word from your example. It clearly can contain characters other than letters and numbers (eg the dash), but some of the items have dots preceding, so it cannot be defined as being delimited by a space.

I defined a "word" as a string that

  • Starts with a letter or digit and ends with a letter or digit
  • Contains both letters and digits
    • Might also contain any other non-space characters except a comma

To do this, I first replaced all the commas with spaces, and then applied an appropriate regular expression. However, this might accept undesired strings, so you might need to be more specific in defining exactly what is a word.

Also, instead of reading the entire file into an Excel workbook, by using the FileSystemObject we can process one line at a time, without reading 300 files into Excel. The base folder is set, as you did, by a constant in the VBA code.

But there are other ways to do this.

Be sure to set the references for early binding as noted in the code:


Option Explicit
'Set References to:
'  Microsoft Scripting Runtime
'  Microsoft VBscript Regular Expressions 5.5

Sub SearchMultipleTextFiles()
    Dim FSO As FileSystemObject
    Dim TS As TextStream, FO As Folder, FI As File, FIs As Files
    Dim RE As RegExp, MC As MatchCollection, M As Match

    Dim WS As Worksheet, RW As Long

    Const sPath As String = "C:\Users\Ron\Desktop"

Set FSO = New FileSystemObject
Set FO = FSO.GetFolder(sPath)

Set WS = ActiveSheet
    WS.Columns.Clear

Set RE = New RegExp
With RE
    .Global = True
    .Pattern = "(?:\d(?=\S*[a-z])|[a-z](?=\S*\d))+\S*[a-z\d]"
    .IgnoreCase = True
End With

For Each FI In FO.Files
    If FI.Name Like "*.txt" Then
        Set TS = FI.OpenAsTextStream(ForReading)
        Do Until TS.AtEndOfStream

            'Change .ReadLine to .ReadAll *might* make this run faster
            ' but would need to be tested.
            Set MC = RE.Execute(Replace(TS.ReadLine, ",", " "))
            If MC.Count > 0 Then
                For Each M In MC
                    RW = RW + 1
                    WS.Cells(RW, 1) = M
                    WS.Cells(RW, 2) = FI.Name
                Next M
            End If
        Loop
    End If
Next FI

End Sub

Upvotes: 2

Related Questions