Reputation: 23
I need help with the regular expression. I have 1000's of lines in a file with the following format:
+ + [COMPILED]\SRC\FileCheck.cs - TotalLine: 99 RealLine: 27 Braces: 18 Comment: 49 Empty: 5
+ + [COMPILED]\SRC\FindstringinFile.cpp - TotalLine: 103 RealLine: 26 Braces: 22 Comment: 50 Empty: 5
+ + [COMPILED]\SRC\findingstring.js - TotalLine: 91 RealLine: 22 Braces: 14 Comment: 48 Empty: 7
+ + [COMPILED]\SRC\restinpeace.h - TotalLine: 95 RealLine: 24 Braces: 16 Comment: 48 Empty: 7
+ + [COMPILED]\SRC\Getsomething.h++ - TotalLine: 168 RealLine: 62 Braces: 34 Comment: 51 Empty: 21
+ + [COMPILED]\SRC\MemDataStream.hh - TotalLine: 336 RealLine: 131 Braces: 82 Comment: 72 Empty: 51
+ + [CONTEXT]\SRC\MemDataStream.sql - TotalLine: 36 RealLine: 138 Braces: 80 Comment: 76 Empty: 59
I need a regular expression that can give me:
I'm using PowerShell to implement this and been successful in getting the results back using Get-Content (to read the file) and Select-String cmdlets. Problem is its taking a long time to get the various substrings and then writing those in the xml file.(I have not put in the code for generating and xml). I've never used regular expressions before but I know using a regular expression would be an efficient way to get the strings..
Help would be appreciated.
The Select-String cmdlet accepts the regular expression to search for the string.
Current code is as follows:
function Get-SubString
Param ([string]$StringtoSearch, [string]$StartOfTheString, [string]$EndOfTheString)
If($StringtoSearch.IndexOf($StartOfTheString) -eq -1 )
[int]$StartOfIndex = $StringtoSearch.IndexOf($StartOfTheString) + $StartOfTheString.Length
[int]$EndOfIndex = $StringtoSearch.IndexOf($EndOfTheString , $StartOfIndex)
if( $StringtoSearch.IndexOf($StartOfTheString)-ne -1 -and $StringtoSearch.IndexOf($EndOfTheString) -eq -1 )
[string]$ExtractedString = $StringtoSearch.Substring($StartOfIndex, $EndOfIndex - $StartOfIndex)
Return $ExtractedString
function Get-FileExtension
Param ( [string]$Path)
#For each file extension we will be searching all lines starting with + +
$SearchIndividualLines = "+ + ["
$TotalLines = select-string -Pattern $SearchIndividualLines -Path
$StandardOutputFilePath -allmatches -SimpleMatch
for($i = $TotalLines.GetLowerBound(0); $i -le $TotalLines.GetUpperBound(0); $i++)
$FileDetailsString = $TotalLines[$i]
#Get File Path
$StartStringForFilePath = "]"
$EndStringforFilePath = "- TotalLine"
$FilePathValue = Get-SubString -StringtoSearch $FileDetailsString -StartOfTheString $StartStringForFilePath -EndOfTheString $EndStringforFilePath
#Write-Host FilePathValue is $FilePathValue
$FileExtensionValue = Get-FileExtension -Path $FilePathValue
#Write-Host FileExtensionValue is $FileExtensionValue
$StartStringForRealLine = "RealLine:"
$EndStringforRealLine = "Braces"
$RealLineValue = Get-SubString -StringtoSearch $FileDetailsString -
StartOfTheString $StartStringForRealLine -EndOfTheString $EndStringforRealLine
Upvotes: 2
Views: 546
Reputation: 6605
Something like this?
PS> (get-content C:\temp\sample.txt) | % { if ($_ -match '.*COMPILED\](\\.*)(\.\w+)\s*.*RealLine:\s*(\d+).*') { [PSCustomObject]@{FilePath=$matches[1]; Extention=$Matches[2]; RealLine=$matches[3]} } }
FilePath Extention RealLine
-------- --------- --------
\SRC\FileCheck .cs 27
\SRC\FindstringinFile .cpp 26
\SRC\findingstring .js 22
\SRC\restinpeace .h 24
\SRC\Getsomething .h 62
\SRC\MemDataStream .hh 131
Update: Stuff inside paranthesis is captured, so if you want to capture [COMPILED], you will need to just need to add that part into the regex:
Instead of
$_ -match '.*COMPILED\](\\.*)
$_ -match '.*(\[COMPILED\]\\.*)
The link in the comment to your question includes a good primer on the regex.
UPDATE 2 Now that you want to capture set of path, I am guessing you sample looks like this:
+ + [COMPILED]C:\project\Rom\Main\Plan\file1.file2.file3\Cmd\Camera.culture.less-Late-PP.min.js - TotalLine: 336 RealLine: 131 Braces: 82 Comment: 72 Empty: 51
The technique above will work, you just need to do a very slight adjustment for the first parenthesis like this:
$_ -match (\[COMPILED\].*)
This will tell regex that you want to capture [COMPILED] and everything that comes after it, until
i.e to the extension, which as a dot and a couple of letters (which might not works if you had an extension like .3gp)
So, your original one liner would instead be:
(get-content C:\temp\sample.txt) | % { if ($_ -match '.(\[COMPILED\].*)(\.\w+)\s*.*RealLine:\s*(\d+).*') { [PSCustomObject]@{FilePath=$matches[1]; Extention=$Matches[2]; RealLine=$matches[3]} } }
Upvotes: 2