I made a tool that parses huge logs +500mb, finds and extracts data from them. Also, I use combinations of -split, -match, -notmatch, -replace inside the all loops to do the processing as quick as possible. So we have a log with a hundred thousand strings (like below), and I want to parse it (edit: there can be multiple instances of needed data in one single line): 2017-01-20 [101] DEBUG IO.EXAMPLE - (F1ссроCd8w) Method: (EXAMPLE) xxx from host 127.0.0.1 with {"Point":"\"Phone1\"","currentVersionNumber":"\"5.5.5.5\"","sku":"87FF"},{"Point2":"\"Phone14\"","currentVersionNumber":"\"5.5.5.5\"","sku":"87LLF"} Get the data (a string) Find "something" (in this example let it be "sku" values) like (87FF) Export all skus with a single sku in a string to a file The output will be like this (without bullets): 87FF GG22 KK13 The current struggle is performance. First script that uses ReadLines and -Value of Set-Content: 1 Minutes 21 Seconds 167 Milliseconds Second with StreamReader and StreamWriter 1 Minutes 59 Seconds 936 Milliseconds #ReadLines and use of -Value of Set-Content to process lines inside of it Measure-Command -Expression { $tocount = [System.IO.File]::ReadLines($file) $a = foreach ($s in $tocount) { $s -split "," -match "sku" -split ":" -notmatch "sku" -replace '[^A-Za-z0-9]' } Set-Content "$HOME\Documents\File(Read)Set-Content(Value).txt" -Value ($a) -Encoding UTF8 } | Select-Object @{n = "Elapsed"; e = { $_.Minutes, "Minutes", $_.Seconds, "Seconds", $_.Milliseconds, "Milliseconds" -join " " } } #StreamReader and Stream Writer with while and foreach-object Measure-Command -Expression { $reader = New-Object System.IO.StreamReader($file) $sw = new-object system.IO.StreamWriter("$HOME\Documents\Stream(Read-Write)LinesWhile.txt") , $true while ($null -ne ($line = $reader.ReadLine())) { $line -split "," -match "sku" -split ":" -notmatch "sku" -replace '[^A-Za-z0-9]' | ForEach-Object { $sw.writeline($_) } } $reader.Close() $sw.Close() } | Select-Object @{n = "Elapsed"; e = { $_.Minutes, "Minutes", $_.Seconds, "Seconds", $_.Milliseconds, "Milliseconds" -join " " } } Why the first option is faster than fourth that considered to be the quickest? Do you know any other ways to do this quickly?

Reputation: 81

StreamReader with StreamWriter shows slow performance PowerShell but File.ReadLines

I made a tool that parses huge logs +500mb, finds and extracts data from them. Also, I use combinations of -split, -match, -notmatch, -replace inside the all loops to do the processing as quick as possible.

So we have a log with a hundred thousand strings (like below), and I want to parse it (edit: there can be multiple instances of needed data in one single line):

2017-01-20 [101] DEBUG IO.EXAMPLE - (F1ссроCd8w) Method: (EXAMPLE) xxx from host 127.0.0.1 with {"Point":"\"Phone1\"","currentVersionNumber":"\"5.5.5.5\"","sku":"87FF"},{"Point2":"\"Phone14\"","currentVersionNumber":"\"5.5.5.5\"","sku":"87LLF"}

Get the data (a string)
Find "something" (in this example let it be "sku" values) like (87FF)
Export all skus with a single sku in a string to a file

The output will be like this (without bullets):

87FF
GG22
KK13 The current struggle is performance.

First script that uses ReadLines and -Value of Set-Content:

1 Minutes 21 Seconds 167 Milliseconds

Second with StreamReader and StreamWriter

1 Minutes 59 Seconds 936 Milliseconds

#ReadLines and use of -Value of Set-Content to process lines inside of it
Measure-Command -Expression {
    $tocount = [System.IO.File]::ReadLines($file)
    $a = foreach ($s in $tocount) {
        $s -split "," -match "sku" -split ":" -notmatch "sku" -replace '[^A-Za-z0-9]'
    }
    Set-Content "$HOME\Documents\File(Read)Set-Content(Value).txt" -Value ($a) -Encoding UTF8
} | Select-Object @{n = "Elapsed"; e = { $_.Minutes, "Minutes", $_.Seconds, "Seconds", $_.Milliseconds, "Milliseconds" -join " " } }

#StreamReader and Stream Writer with while and foreach-object
Measure-Command -Expression {
    $reader = New-Object System.IO.StreamReader($file)
    $sw = new-object system.IO.StreamWriter("$HOME\Documents\Stream(Read-Write)LinesWhile.txt") , $true
    while ($null -ne ($line = $reader.ReadLine())) {
        $line -split "," -match "sku" -split ":" -notmatch "sku" -replace '[^A-Za-z0-9]' | ForEach-Object { $sw.writeline($_) }
    }
    $reader.Close()
    $sw.Close()
} | Select-Object @{n = "Elapsed"; e = { $_.Minutes, "Minutes", $_.Seconds, "Seconds", $_.Milliseconds, "Milliseconds" -join " " } }

Why the first option is faster than fourth that considered to be the quickest? Do you know any other ways to do this quickly?

Upvotes: 1

Answers (4)

jdweng

Reputation: 34431

I would use regex to parse the lines to a list of class objects then search the list. I saved you line 5 million times to a file and then used code below to parse file. It took 1 minute 36 seconds which seems to be faster than your code and I'm parsing all the fields. It only took 0.26 seconds to do query after data was parsed. The file size is 849,610KB

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.IO;
namespace ConsoleApplication1
{
    class Program
    {
        const string FILENAME = @"c:\temp\test.txt";
        static void Main(string[] args)
        {
            List<Data> data = new List<Data>();
            StreamReader reader = new StreamReader(FILENAME);
            string line = "";
            while((line = reader.ReadLine()) != null)
            {
                data.Add(new Data(line));
            }
            var results = data.Where(x => x.sku == "87FF");

            int count = results.Count();            
         }
    }
    public class Data
    {
        const string pattern1 = @"^(?'date'[^\s]+)\s+\[(?'id'\d+)\](?'description'[^-]+)-\s+\((?'code'[^\)]+)\)\s+Method:\s+(?'method'.*)from host\s+(?'host'[^\s]+)";
        const string pattern2 = @"""(?'key'[^""]+)"":""(\\""(?'value'[^\\]+)|(?'value'[^""]+))";

        public DateTime date { get; set; }
        public int id { get; set; }
        public string description { get; set; }
        public string code { get; set; }
        public string method { get; set; }
        public string host { get; set; }
        public string point { get; set; }
        public string currentVersionNumber { get; set; }
        public string sku { get; set; }
        public Data() { }
        public Data(string line)
        {
            Match match = Regex.Match(line, pattern1);
            date = DateTime.Parse(match.Groups["date"].Value);
            id = int.Parse(match.Groups["id"].Value);
            description = match.Groups["description"].Value.Trim();
            code = match.Groups["code"].Value;
            method = match.Groups["method"].Value;
            host = match.Groups["host"].Value;

            MatchCollection matches = Regex.Matches(line, pattern2);
            foreach (Match m in matches.Cast<Match>().Select(x => x))
            {
                string key = m.Groups["key"].Value;
                string value = m.Groups["value"].Value;

                switch (key)
                {
                    case "Point" :
                        point = value;
                        break;

                    case "currentVersionNumber":
                        currentVersionNumber = value;
                        break;

                    case "sku":
                        sku = value;
                        break;
                    default :
                        break;
                }
            }
        }
    }


}

Upvotes: 0

mklement0

Reputation: 440162

You can improve your PowerShell code's performance as follows:

Use a switch statement with the -Regex switch for fast line-by-line processing based on regexes; utilizing only a single regex operation per input line.
Use a System.IO.StreamWriter instance to write to the target file.

Caveat: The solutions below assume that only one sku property value is present on each input line - switch -Regex behaves like the -match operator, which finds at most one match in the input string - see the bottom section for a solution that captures all matches per line.

$sw = [System.IO.StreamWriter]::new("$HOME\Documents\StreamWrite.txt")

switch -Regex -File $file {
  '"sku":"([^"]+)' { $sw.WriteLine($Matches[1]) }
}

$sw.Close()

^{Note: The regex assumes that the format of the sample line is strictly adhered to. However, if variations in whitespace can occur (e.g., "sku":"87FF" vs. "sku": "87FF") the regex needs to account for that: '"sku":\s*"([^"]+)'}

The above uses streaming processing (processing one line at a time), which avoids having to read the entire file into memory at once.

If you don't mind reading the whole file at once, you can simplify the command to use a single Set-Content call on the collected-in-memory output lines for writing the output file, but note that my tests show that it won't be faster:

Set-Content $outFile -Value $(
  switch -Regex -File $file {
    '"sku":"([^"]+)' { $Matches[1] }
  }
)

Note: For best performance, the output lines are passed via the -Value parameter to Set-Content, at once, as a single array; if you were to use the pipeline instead
($(switch ...) | Set-Content $outFile), the command would be much slower, because the lines would pass through the pipeline one by one.

To capture all values that match the regex on a given line, use the following approach:

$sw = [System.IO.StreamWriter]::new("$HOME\Documents\StreamWrite.txt")

# Create a precompiled, case-sensitive regex.
$re = [regex]::new('(?<="sku":")[^"]+', 'Compiled')

switch -file $file {
  default {
    foreach ($val in $re.Matches($_).Value) {
      $sw.WriteLine($val)
    }
  }
}

$sw.Close()

Upvotes: 2

Theo

Reputation: 61218

I guess using switch -Regex -File might be the fastest, but you will have to measure this on your test log(s) yourself.

$result = switch -Regex -File $file {
    '"sku"\s?:\s?"([a-z0-9]+)"' { $Matches[1] }
}
$result | Set-Content -Path '$HOME\Documents\File(switch).txt' -Encoding UTF8

Upvotes: 0

LN Bhat

Reputation: 26

You can use Get-Childitem *.log | Select-String -pattern , check if it helps with performance

Upvotes: 0

StreamReader with StreamWriter shows slow performance PowerShell but File.ReadLines

Answers (4)

Related Questions