Reputation: 81
I made a tool that parses huge logs +500mb, finds and extracts data from them. Also, I use combinations of -split, -match, -notmatch, -replace inside the all loops to do the processing as quick as possible.
So we have a log with a hundred thousand strings (like below), and I want to parse it (edit: there can be multiple instances of needed data in one single line):
2017-01-20 [101] DEBUG IO.EXAMPLE - (F1ссроCd8w) Method: (EXAMPLE) xxx from host 127.0.0.1 with {"Point":"\"Phone1\"","currentVersionNumber":"\"5.5.5.5\"","sku":"87FF"},{"Point2":"\"Phone14\"","currentVersionNumber":"\"5.5.5.5\"","sku":"87LLF"}
The output will be like this (without bullets):
First script that uses ReadLines and -Value of Set-Content:
Second with StreamReader and StreamWriter
#ReadLines and use of -Value of Set-Content to process lines inside of it
Measure-Command -Expression {
$tocount = [System.IO.File]::ReadLines($file)
$a = foreach ($s in $tocount) {
$s -split "," -match "sku" -split ":" -notmatch "sku" -replace '[^A-Za-z0-9]'
}
Set-Content "$HOME\Documents\File(Read)Set-Content(Value).txt" -Value ($a) -Encoding UTF8
} | Select-Object @{n = "Elapsed"; e = { $_.Minutes, "Minutes", $_.Seconds, "Seconds", $_.Milliseconds, "Milliseconds" -join " " } }
#StreamReader and Stream Writer with while and foreach-object
Measure-Command -Expression {
$reader = New-Object System.IO.StreamReader($file)
$sw = new-object system.IO.StreamWriter("$HOME\Documents\Stream(Read-Write)LinesWhile.txt") , $true
while ($null -ne ($line = $reader.ReadLine())) {
$line -split "," -match "sku" -split ":" -notmatch "sku" -replace '[^A-Za-z0-9]' | ForEach-Object { $sw.writeline($_) }
}
$reader.Close()
$sw.Close()
} | Select-Object @{n = "Elapsed"; e = { $_.Minutes, "Minutes", $_.Seconds, "Seconds", $_.Milliseconds, "Milliseconds" -join " " } }
Why the first option is faster than fourth that considered to be the quickest? Do you know any other ways to do this quickly?
Upvotes: 1
Views: 838
Reputation: 34421
I would use regex to parse the lines to a list of class objects then search the list. I saved you line 5 million times to a file and then used code below to parse file. It took 1 minute 36 seconds which seems to be faster than your code and I'm parsing all the fields. It only took 0.26 seconds to do query after data was parsed. The file size is 849,610KB
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.IO;
namespace ConsoleApplication1
{
class Program
{
const string FILENAME = @"c:\temp\test.txt";
static void Main(string[] args)
{
List<Data> data = new List<Data>();
StreamReader reader = new StreamReader(FILENAME);
string line = "";
while((line = reader.ReadLine()) != null)
{
data.Add(new Data(line));
}
var results = data.Where(x => x.sku == "87FF");
int count = results.Count();
}
}
public class Data
{
const string pattern1 = @"^(?'date'[^\s]+)\s+\[(?'id'\d+)\](?'description'[^-]+)-\s+\((?'code'[^\)]+)\)\s+Method:\s+(?'method'.*)from host\s+(?'host'[^\s]+)";
const string pattern2 = @"""(?'key'[^""]+)"":""(\\""(?'value'[^\\]+)|(?'value'[^""]+))";
public DateTime date { get; set; }
public int id { get; set; }
public string description { get; set; }
public string code { get; set; }
public string method { get; set; }
public string host { get; set; }
public string point { get; set; }
public string currentVersionNumber { get; set; }
public string sku { get; set; }
public Data() { }
public Data(string line)
{
Match match = Regex.Match(line, pattern1);
date = DateTime.Parse(match.Groups["date"].Value);
id = int.Parse(match.Groups["id"].Value);
description = match.Groups["description"].Value.Trim();
code = match.Groups["code"].Value;
method = match.Groups["method"].Value;
host = match.Groups["host"].Value;
MatchCollection matches = Regex.Matches(line, pattern2);
foreach (Match m in matches.Cast<Match>().Select(x => x))
{
string key = m.Groups["key"].Value;
string value = m.Groups["value"].Value;
switch (key)
{
case "Point" :
point = value;
break;
case "currentVersionNumber":
currentVersionNumber = value;
break;
case "sku":
sku = value;
break;
default :
break;
}
}
}
}
}
Upvotes: 0
Reputation: 437052
You can improve your PowerShell code's performance as follows:
Use a switch
statement with the -Regex
switch for fast line-by-line processing based on regexes; utilizing only a single regex operation per input line.
Use a System.IO.StreamWriter
instance to write to the target file.
Caveat: The solutions below assume that only one sku
property value is present on each input line - switch -Regex
behaves like the -match
operator, which finds at most one match in the input string - see the bottom section for a solution that captures all matches per line.
$sw = [System.IO.StreamWriter]::new("$HOME\Documents\StreamWrite.txt")
switch -Regex -File $file {
'"sku":"([^"]+)' { $sw.WriteLine($Matches[1]) }
}
$sw.Close()
Note: The regex assumes that the format of the sample line is strictly adhered to. However, if variations in whitespace can occur (e.g., "sku":"87FF"
vs. "sku": "87FF"
) the regex needs to account for that: '"sku":\s*"([^"]+)'
The above uses streaming processing (processing one line at a time), which avoids having to read the entire file into memory at once.
If you don't mind reading the whole file at once, you can simplify the command to use a single Set-Content
call on the collected-in-memory output lines for writing the output file, but note that my tests show that it won't be faster:
Set-Content $outFile -Value $(
switch -Regex -File $file {
'"sku":"([^"]+)' { $Matches[1] }
}
)
Note: For best performance, the output lines are passed via the -Value
parameter to Set-Content
, at once, as a single array; if you were to use the pipeline instead
($(switch ...) | Set-Content $outFile)
, the command would be much slower, because the lines would pass through the pipeline one by one.
To capture all values that match the regex on a given line, use the following approach:
$sw = [System.IO.StreamWriter]::new("$HOME\Documents\StreamWrite.txt")
# Create a precompiled, case-sensitive regex.
$re = [regex]::new('(?<="sku":")[^"]+', 'Compiled')
switch -file $file {
default {
foreach ($val in $re.Matches($_).Value) {
$sw.WriteLine($val)
}
}
}
$sw.Close()
Upvotes: 2
Reputation: 61013
I guess using switch -Regex -File
might be the fastest, but you will have to measure this on your test log(s) yourself.
$result = switch -Regex -File $file {
'"sku"\s?:\s?"([a-z0-9]+)"' { $Matches[1] }
}
$result | Set-Content -Path '$HOME\Documents\File(switch).txt' -Encoding UTF8
Upvotes: 0
Reputation: 26
You can use Get-Childitem *.log | Select-String -pattern , check if it helps with performance
Upvotes: 0