Sasha
Sasha

Reputation: 4014

Powershell binary grep

Is there a way to determine whether a specified file contains a specified byte array (at any position) in powershell?

Something like:

fgrep --binary-files=binary "$data" "$filepath"

Of course, I can write a naive implementation:

function posOfArrayWithinArray {
    param ([byte[]] $arrayA, [byte[]]$arrayB)
    if ($arrayB.Length -ge $arrayA.Length) {
        foreach ($pos in 0..($arrayB.Length - $arrayA.Length)) {
            if ([System.Linq.Enumerable]::SequenceEqual(
                $arrayA,
                [System.Linq.Enumerable]::Skip($arrayB, $pos).Take($arrayA.Length)
            )) {return $pos}
        }
    }
    -1
}

function posOfArrayWithinFile {
    param ([byte[]] $array, [string]$filepath)
    posOfArrayWithinArray $array (Get-Content $filepath -Raw -AsByteStream)
}

// They return position or -1, but simple $false/$true are also enough for me.

— but it's extremely slow.

Upvotes: 6

Views: 4633

Answers (4)

iRon
iRon

Reputation: 23788

Sorry, for the additional answer. It is not usual to do so, but the universal question intrigues me and the approach and information of my initial "using -Like" answer is completely different. Btw, if you looking for a positive response to the question "I believe that it must exist in .NET" to accept an answer, it probably not going to happen, the same quest exists for StackOverflow searches in combination with C#, .Net or Linq.
Anyways, the fact that nobody is able to find the single assumed .Net command for this so far, it is quiet understandable that several semi-.Net solutions are being purposed instead but I believe that this will cause some undesired overhead for a universal function.
Assuming that you ByteArray (the byte array being searched) and SearchArray (the byte array to be searched) are completely random. There is only a 1/256 chance that each byte in the ByteArray will match the first byte of the SearchArray. In that case you don't have to look further, and if it does match, the chance that the second byte also matches is 1/2562, etc. Meaning that the inner loop will only run about 1.004 times as much as the outer loop. In other words, the performance of everything outside the inner loop (but in the outer loop) is almost as important as what is in the inner loop!
Note that this also implies that the chance a 500Kb random sequence exists in a 100Mb random sequence is virtually zero. (So, how random are your given binary sequences actually?, If they are far from random, I think you need to add some more details to your question). A worse case scenario for my assumption will be a ByteArray existing of the same bytes (e.g. 0, 0, 0, ..., 0, 0, 0) and a SearchArray of the same bytes ending with a different byte (e.g. 0, 0, 0, ..., 0, 0, 1).

Based on this, it shows again (I have also proven this in some other answers) that native PowerShell commands aren't that bad and possibly could even outperform .Net/Linq commands in some cases. In my testing, the below Find-Bytes function is about 20% till twice as fast as the function in your question:

Find-Bytes

Returns the index of where the -Search byte sequence is found in the -Bytes byte sequence. If the search sequence is not found a $Null ([System.Management.Automation.Internal.AutomationNull]::Value) is returned.

Parameters

-Bytes
The byte array to be searched

-Search
The byte array to search for

-Start
Defines where to start searching in the Bytes sequence (default: 0)

-All
By default, only the first index found will be returned. Use the -All switch to return the remaining indexes of any other search sequences found.

Function Find-Bytes([byte[]]$Bytes, [byte[]]$Search, [int]$Start, [Switch]$All) {
    For ($Index = $Start; $Index -le $Bytes.Length - $Search.Length ; $Index++) {
        For ($i = 0; $i -lt $Search.Length -and $Bytes[$Index + $i] -eq $Search[$i]; $i++) {}
        If ($i -ge $Search.Length) { 
            $Index
            If (!$All) { Return }
        } 
    }
}

Usage example:

$a = [byte[]]("the quick brown fox jumps over the lazy dog".ToCharArray())
$b = [byte[]]("the".ToCharArray())

Find-Bytes -all $a $b
0
31

Benchmark
Note that you should open a new PowerShell session to properly benchmark this as Linq uses a large cache that properly doesn't apply to your use case.

$a = [byte[]](&{ foreach ($i in (0..500Kb)) { Get-Random -Maximum 256 } })
$b = [byte[]](&{ foreach ($i in (0..500))   { Get-Random -Maximum 256 } })

Measure-Command {
    $y = Find-Bytes $a $b
}

Measure-Command {
    $x = posOfArrayWithinArray $b $a
}

Upvotes: 5

iRon
iRon

Reputation: 23788

Just formalizing my comments and agreeing with your comment:

I dislike the idea of converting byte sequences to character sequences at all (I'd better have functionality to match byte (or other) sequences as they are), among the conversion-to-character-strings-implying solutions this seems to be one of the quickest

Performance

String manipulations are usually expensive but re-initializing a LINQ call is apparently pretty expensive as well. I guess, that you might presume that the native algorithms for the PowerShell string representation and methods (operators) like -Like are meanwhile completely squeezed.

Memory

Aside from some founded performance disadvantages, there is a memory disadvantage as well by converting each byte to a decimal string representation. In the purposed solution, each byte will take an average of 2.57 bytes (depending on the number of decimal digits of each byte: (1 * 10 / 256) + (2 * 90 /256) + (3 * 156 / 256)). Besides you will use/need an extra byte for separating the numeric representations. In total, this will increase the sequence about 3.57 times!.
You might consider saving bytes by e.g. converting it to hexadecimal and/or combine the separator, but that will likely result in an expensive conversion again.

Easy

Anyways, the easy way is probably still the most effective.
This comes down to the following simplified syntax:

" $Sequence " -Like "* $SubSequence *" # $True if $Sequence contains $SubSequence

(Where $Sequence and $SubSequence are binary arrays of type: [Byte[]])

Note 1: the spaces around the variables are important. This will prevent a false positive in case a 1 (or 2) digit byte representation overlaps with a 2 (or 3) digit byte representation. E.g.: 123 59 74 contains 23 59 7 in the string representation but not in the actual bytes.

Note 2: This syntax will tell you only whether $arrayA contains $arrayB ($True or $False). There is no clue where $arrayB actually resides in $arrayA. If you need to know this, or e.g. want to replace $arrayB with something else, refer to this answer: Methods to hex edit binary files via PowerShell .

Upvotes: 1

Sasha
Sasha

Reputation: 4014

I've determined that the following can work as a workaround:

(Get-Content $filepath -Raw -Encoding 28591).IndexOf($fragment)

— i.e. any bytes can be successfully matched by PowerShell strings (in fact, .NET System.Strings) when we specify binary-safe encoding. Of course, we need to use the same encoding for both the file and fragment, and the encoding must be really binary-safe (e.g. 1250, 1000 and 28591 fit, but various species of Unicode (including the default BOM-less UTF-8) don't, because they convert any non-well-formed code-unit to the same replacement character (U+FFFD)). Thanks to Theo for clarification.

On older PowerShell, you can use:

[System.Text.Encoding]::GetEncoding(28591).
    GetString([System.IO.File]::ReadAllBytes($filepath)).
    IndexOf($fragment)

Sadly, I haven't found a way to match sequences universally (i.e. a common method to match sequences with any item type: integer, object, etc). I believe that it must exist in .NET (especially that particual implementation for sequences of characters exists). Hopefully, someone will suggest it.

Upvotes: 2

Theo
Theo

Reputation: 61188

The below code may prove to be faster, but you will have to test that out on your binary files:

function Get-BinaryText {
    # converts the bytes of a file to a string that has a
    # 1-to-1 mapping back to the file's original bytes. 
    # Useful for performing binary regular expressions.
    Param (
        [Parameter(Mandatory = $true, ValueFromPipeline = $true, ValueFromPipelineByPropertyName = $true)]
        [ValidateScript( { Test-Path $_ -PathType Leaf } )]
        [Alias('FullName','FilePath')]
        [string]$Path
    )

    $Stream = New-Object System.IO.FileStream -ArgumentList $Path, 'Open', 'Read'

    # Note: Codepage 28591 returns a 1-to-1 char to byte mapping
    $Encoding     = [Text.Encoding]::GetEncoding(28591)
    $StreamReader = New-Object System.IO.StreamReader -ArgumentList $Stream, $Encoding
    $BinaryText   = $StreamReader.ReadToEnd()

    $Stream.Dispose()
    $StreamReader.Dispose()

    return $BinaryText
}

# enter the byte array to search for here
# for demo, I'll use 'SearchMe' in bytes
[byte[]]$searchArray = 83,101,97,114,99,104,77,101

# create a regex from the $searchArray bytes
# 'SearchMe' --> '\x53\x65\x61\x72\x63\x68\x4D\x65'
$searchString = ($searchArray | ForEach-Object { '\x{0:X2}' -f $_ }) -join ''
$regex = [regex]$searchString

# read the file as binary string
$binString = Get-BinaryText -Path 'D:\test.bin'

# use regex to return the 0-based starting position of the search string
# return -1 if not found
$found = $regex.Match($binString)
if ($found.Success) { $found.Index } else { -1}

Upvotes: 2

Related Questions