HelpingHand
HelpingHand

Reputation: 209

Powershell - fast way of searching files based on rules

Mainly looking for some pointers and a little bit of code. The task I have is to search through a number of files for different strings and create a log of the matches.

Initially I was parsing through each file looking for a single string but it was too slow once I had thousands of files around 1MB each. I therefore would like to try opening each file once and scan the file for multiple strings, attributing them in a log to the various rules.

I have created the following rules file:

{"Logs": {
   "Component":
   {
     "Files":[
       {
         "name": "test.txt",
         "encoding": "UTF8",
         "rules":[{
           "Rule1":"this is text"
           }]
       },
       {
         "name": "test2.txt",
         "encoding": "UTF8",
         "rules":[{
            "Rule2": "this is text1",
            "Rule3": "this is text3"
            }]
       }
     ]
   }
}}

Maybe that needs to be improved and can be changed. The following Powershell uses the rule to go searching through files:

Function ParseFile($Files){
write-host "Parsing file" $Files.Name "for text " $Files.rules

 Get-ChildItem "." -Recurse -Filter $Files.Name | 
   Foreach-Object {
     write-host $_.FullName

     Foreach($line in Get-Content $_.FullName -encoding $Files.encoding ) {

     ##Check if the current line from file matches a rule from the $Files.Rules array.
     ##If so log the file, line and rule ID to a CSV file. E.g.:
     ##RuleID, RuleString, LineFromFile, FileName

     }
   } 
}

$JSON = Get-Content -Raw -Path rule.json | ConvertFrom-Json

foreach ($files in $JSON.Logs.Component.Files  ){
  write-host $files.name
  write-host "============================="
  ParseFile $files
}

Does the above make sense for the quickest way to search and classify? I'm not sure quite how to approach the commented section. I assume $line -in $Files.rules but I don't think the array is quite right for this.

Any suggestions welcome and thanks in advance.

Upvotes: 1

Views: 1016

Answers (2)

Frode F.
Frode F.

Reputation: 54911

Here's an alternative using regex. I modified the JSON to make it easier to parse. The original JSON can work if needed by getting RuleID and RuleString using name and value properties in $_.rules.psobject.properties.

This solution requires RuleID to be single word.

rules.json

{"Logs": {
    "Component":
    {
        "Files":[
        {
            "name": "test.txt",
            "encoding": "UTF8",
            "rules":[{
                "RuleID": "Rule1",
                "Rule": "this is text"
            }]
        },
        {
            "name": "test2.txt",
            "encoding": "UTF8",
            "rules":[
            {
                "RuleID": "Rule2",
                "Rule": "this is text1"
            },
            {
                "RuleID": "Rule3",
                "Rule": "this is text3"
            }
            ]
        }
        ]
    }
}}

Code:

$JSON.Logs.Component.Files | ForEach-Object {
    $item = $_

    #Create regex-pattern
    $pattern = ($item.rules | ForEach-Object { "(?'$($_.RuleID)'$([regex]::Escape($_.Rule)))" }) -join '|'

    #Find matching files
    Get-ChildItem -Path "." -Recurse -Filter $item.Name |
    Select-String -Pattern $pattern -Encoding $item.Encoding -AllMatches |
    ForEach-Object {

        $MatchedRule = $_.Matches.Groups | Where-Object { $_.Name -ne '0' -and $_.Success }

        New-Object -TypeName psobject -Property @{
            RuleID = $MatchedRule.Name
            RuleString = $MatchedRule.Value
            LineFromFile = $_.Line
            FileName = $_.Path
        }

    }
} | Export-Csv -Path results.csv -NoTypeInformation -Encoding UTF8

results.csv:

"FileName","LineFromFile","RuleID","RuleString"
"D:\New folder\test.txt","foo this is text1 bar","Rule1","this is text"
"D:\New folder\test.txt","this is text3ss","Rule1","this is text"
"D:\New folder\test2.txt","foo this is text1 bar","Rule2","this is text1"
"D:\New folder\Test\test2.txt","this is text3ss","Rule3","this is text3"

Upvotes: 2

boxdog
boxdog

Reputation: 8442

I adjusted your JSON slightly:

{"Logs": {
   "Component":
   {
     "Files":[
       {
         "name": "test.txt",
         "encoding": "UTF8",
         "rules":["this is text"
         ]
       },
       {
         "name": "test2.txt",
         "encoding": "UTF8",
         "rules":["this is text1",
          "this is text3"
         ]
       }
     ]
   }
}}

Using this, here is a possible solution:

$JSON = Get-Content -Raw -Path rules.json | ConvertFrom-Json

$JSON.Logs.Component.Files |
    ForEach-Object {
        $fileName = $_.Name
        $rules = $_.rules

        Get-Content $fileName -encoding $_.encoding |
            ForEach-Object {
                for($i=0;$i -lt $rules.Count;$i++)
                {
                    if($_ -like "*$($rules[$i])*")
                    {
                        [PsCustomObject]@{RuleNumber = ($i+1); 
                                          RuleString = $rules[$i];
                                          MatchingText = $_;
                                          File = $filename} | 
                            Export-Csv matches.csv -Append -NoTypeInformation
                    }
                }
            }
    }

Upvotes: 1

Related Questions