Fábio Linhares
Fábio Linhares

Reputation: 355

Efficiently search a string in large files

How can I check if a string exists in:

  1. 1 text file;
  2. size up until 10GB;
  3. taking into account that the file is only one line;
  4. the file only contains random numbers 1 to 9;
  5. using powershell (because I think it will be more efficient, although I don't know how to program in this language);

I have tried this in batch:

FINDSTR "897516" decimal_output.txt
pause

But as I said I need the faster and more efficient way to do this.


I also tried this code that I have found in stackoverflow:

$SEL = Select-String -Path C:\Users\fabio\Desktop\CONVERTIDOS\dec_output.txt -Pattern "123456"

if ($SEL -ne $null)
{
echo Contains String
}
else
{
echo Not Contains String
}

But I get the error below, and I don't know if this code is the most solid or adequate. The error:

Select-String : Tipo de excepção 'System.OutOfMemoryException' accionado. At C:\Users\fabio\Desktop\1.ps1:1 char:8 + $SEL = Select-String -Path C:\Users\fabio\Desktop\CONVERTIDOS\dec_out ... + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + CategoryInfo : NotSpecified: (:) [Select-String], OutOfMemoryException + FullyQualifiedErrorId : System.OutOfMemoryException,Microsoft.PowerShell.Commands.SelectStringCommand

Upvotes: 4

Views: 3532

Answers (1)

FatalBulletHit
FatalBulletHit

Reputation: 842

This should do the job:

#################################################################################################################
#
# Searches for a user defined string in the $input_file and counts matches. Works with files of any size.
#
# Adjust source directory and input file name.
#
$source = "C:\adjust\path"
$input_file = "file_name.extension"
#
#
# Define the string you want to search for. Keep quotation marks even if you only search for numbers (otherwise 
# $pattern.Length will be 1 and this script will no longer work with files larger than the $split_size)!
#
$pattern = "Enter the string to search for in here"
#
#
# Using Get-Content on an input file with a size of 1GB or more will cause System.OutOfMemoryExceptions,
# therefore a large file gets temporarily split up.
#
$split_size = 100MB
#
#
# Thanks @Bob (https://superuser.com/a/1295082/868077)
#################################################################################################################

Set-Location $source


if (test-path ".\_split") {

    while ($overwrite -ne "true" -and $overwrite -ne "false") {

        "`n"
        $overwrite = Read-Host ' Splitted files already/still exist! Delete and overwrite?'

        if ($overwrite -match "y") {

            $overwrite = "true"
            Remove-Item .\_split -force -recurse
            $a = "`n Deleted existing splitted files!"

        } elseif ($overwrite -match "n") {

            $overwrite = "false"
            $a = "`n Continuing with existing splitted files!"

        } elseif ($overwrite -match "c") {

            exit

        } else {

            Write-Host "`n Error: Invalid input!`n Type 'y' for 'yes'. Type 'n' for 'no'. Type 'c' for 'cancel'. `n`n`n"

        }

    }

}

Clear-Host


if ((Get-Item $input_file).Length -gt $split_size) {

    while ($delete -ne "true" -and $delete -ne "false") {

        "`n"
        $delete = Read-Host ' Delete splitted files afterwards?'

        if ($delete -match "y") {

            $delete = "true"
            $b = "`n Splitted files will be deleted afterwards!"

        } elseif ($delete -match "n") {

            $delete = "false"
            $b = "`n Splitted files will not be deleted afterwards!"

        } elseif ($delete -match "c") {

            exit

        } else {

            Write-Host "`n Error: Invalid input!`n Type 'y' for 'yes'. Type 'n' for 'no'. Type 'c' for 'cancel'. `n`n`n"

        }

    }

    Clear-Host

    $a
    $b


    Write-Host `n This may take some time!

    if ($overwrite -ne "false") {

        New-Item -ItemType directory -Path ".\_split" >$null 2>&1
        [Environment]::CurrentDirectory = Get-Location

        $bytes = New-Object byte[] 4096
        $in_file = [System.IO.File]::OpenRead($input_file)
        $file_count = 0
        $finished = $false

        while (!$finished) {

            $file_count++
            $bytes_to_read = $split_size
            $out_file = New-Object System.IO.FileStream ".\_split\_split_$file_count.splt",CreateNew,Write,None

            while ($bytes_to_read) {

                $bytes_read = $in_file.Read($bytes, 0, [Math]::Min($bytes.Length, $bytes_to_read))

                if (!$bytes_read) {

                    $finished = $true
                    break

                }

                $bytes_to_read -= $bytes_read
                $out_file.Write($bytes, 0, $bytes_read)

            }

            $out_file.Dispose()

        }

        $in_file.Dispose()

    }

    $i++

    while (Test-Path ".\_split\_split_$i.splt") {

        $cur_file = (Get-Content ".\_split\_split_$i.splt")
        $temp_count = ([regex]::Matches($cur_file, "$pattern")).Count
        $match_count += $temp_count

        $n = $i - 1

        if (Test-Path ".\_split\_split_$n.splt") {

            if ($cur_file.Length -ge $pattern.Length) {

                $file_transition = $prev_file.Substring($prev_file.Length - ($pattern.Length - 1)) + $cur_file.Substring(0,($pattern.Length - 1))

            } else {

                $file_transition = $prev_file.Substring($prev_file.Length - ($pattern.Length - 1)) + $cur_file

            }

            $temp_count = ([regex]::Matches($file_transition, "$pattern")).Count
            $match_count += $temp_count

        }

        $prev_file = $cur_file
        $i++

    }

} else {

    $a
    $match_count = ([regex]::Matches($input_file, "$pattern")).Count

}


if ($delete -eq "true") {

    Remove-Item ".\_split" -Force -Recurse

}


if ($match_count -ge 1) {

    Write-Host "`n`n String '$pattern' found:`n`n $match_count matches!"

} else {

    Write-Host "`n`n String '$pattern' not found!"

}


Write-Host `n`n`n`n`n

Pause

This will split a large file into mutliple smaller files, search them for $pattern and count the matches (taking file transitions into account).

It also offers you to delete or keep the splitted files afterwards so you can reuse them and don't have to split the large file every time you run this script.

Upvotes: 4

Related Questions