Reputation: 50712
I have a list of regular expressions(about 2000) and over a million html files. I want to check if each regular expression success on every file or not. How to do this on powershell?
Performance is important, so I don't want to loop through regular expressions.
I try
$text | Select-String -Pattern pattern1, pattern2,...
And it returns all matches, but I also want to find out, which pattern success which one not. I need to build a list of success regular expressions for each file
Upvotes: 2
Views: 812
Reputation: 54881
You could try something like this:
$regex = "^test","e2$" #Or use (Get-Content <path to your regex file>)
$ht = @{}
#Modify Get-Childitem to your criterias(filter, path, recurse etc.)
Get-ChildItem -Filter *.txt | Select-String -Pattern $regex | ForEach-Object {
$ht[$_.Path] += @($_ | Select-Object -ExpandProperty Pattern)
}
Test-output:
$ht | Format-Table -AutoSize
Name Value
---- -----
C:\Users\graimer\Desktop\New Text Document (2).txt {e2$}
C:\Users\graimer\Desktop\New Text Document.txt {^test, e2$}
You didn't specify how you wanted the output.
UPDATE: To match multiple patterns on a single line, try this(mjolinor's answer is probably faster then this).
$regex = "^test","e2$" #Or use (Get-Content <path to your regex file>)
$ht = @{}
#Modify Get-Childitem to your criterias(filter, path, recurse etc.)
$regex | ForEach-Object {
$pattern = $_
Get-ChildItem -Filter *.txt | Select-String -Pattern $pattern | ForEach-Object {
$ht[$_.Path] += @($_ | Select-Object -ExpandProperty Pattern)
}
}
UPDATE2: I don't have enough samples to try it, but since you have such a huge amount of files, you migh want to try reading the file into memory before looping through the patterns. It may be faster.
$regex = "^test","e2$" #Or use (Get-Content <path to your regex file>)
$ht = @{}
#Modify Get-Childitem to your criterias(filter, path, recurse etc.)
Get-ChildItem -Filter *.txt | ForEach-Object {
$text = $_ | Get-Content
$filename = $_.FullName
$regex | ForEach-Object {
$text | Select-String -Pattern $_ | ForEach-Object {
$ht[$filename] += @($_ | Select-Object -ExpandProperty Pattern)
}
}
}
Upvotes: 2
Reputation: 68263
I don't see any way around doing a foreach through the regex collection.
This is the best I could come up with performance-wise:
$regexes = 'pattern1','pattern2'
$files = get-childitem -Path <file path> |
select -ExpandProperty fullname
$ht = @{}
foreach ($file in $files)
{
$ht[$file] = New-Object collections.arraylist
foreach ($regex in $regexes)
{
if (select-string $regex $file -Quiet)
{
[void]$ht[$file].add($regex)
}
}
}
$ht
You could speed up the process by using background jobs and dividing up the file collection among the jobs.
Upvotes: 1