Reputation: 53
Been going crazy all week unable to solve this issue. I have a dictionary word file that will be a few million words at one point, for now let's assume it's just a text file "Words.txt" which has:
App Apple Application Bar Bat Batter Cap Capital Candy
What I need it to do is to match each string against the rest of the file and only write output of the first hit. This will be alphabetical.
Example the desired output from the words above would be:
App - due to pattern "App" being seen first and skips "Apple" and "Application Bar - due to pattern "Bar", unique Bat - due to pattern "Bat" being seen first and skips "Batter" Cap - due to pattern "Cap" being seen first and skips "Capital" Candy - due to pattern "Candy", unique
What I absolutely cannot figure out how to do it is how to ignore matches that happen after initial hit and move to a 'new' pattern. It would be ok if other redundant patters are overwritten or just skipped, doesnt matter how.
I have a script to match patterns but I dont know how to end up with the desired output :( Any help?!?!
$Words = "C:\Words.txt"
[System.Collections.ArrayList]$WordList = Get-Content $Words
$Words
$Words2 = $Words
$i = 0
$r = 0
Foreach ($item in $Words)
{
foreach ($item2 in $Words2)
{
if ($item2 -like "$item*")
{
write-host $("Match " + [string]$i + " " + $item + " " + [string]$r + " " + $item2)
}
$r++
}
$i++
}
Upvotes: 1
Views: 100
Reputation: 437218
It's sufficient to process the lines one by one and compare them to the most recent unique prefix:
$prefix = '' # initialize the prefix pattern
foreach ($line in [IO.File]::ReadLines('C:\Words.txt')) {
if ($line -like $prefix) { continue } # same prefix, skip
$line # output new unique prefix
$prefix = "$line*" # save new prefix pattern
}
Note: Since you mention the input file being large, I'm using System.IO.File.ReadLines
rather than Get-Content
to read the file, for superior performance.
Note: Your sample input path is a full path anyway, but be sure to always pass full paths to .NET methods, because .NET's working directory usually differs from PowerShell's.
If you wrap the foreach
loop in & { ... }
, you can pipe the result in streaming fashion (line by line, without collecting all results in memory first) to Set-Content
.
However, using a .NET type for saving as well will perform much better - see the bottom section of this answer.
Upvotes: 1