Spodg
Spodg

Reputation: 53

My Loop of the Loop is painstakingly slow

I have an object $Posts which contain a title, and a SimTitles field amongst a few others. I need to compare each title to the other titles and give it a Similarity score in the SimTitles field. So if I have 80 $Posts, it will need to cover 6400 re-iterations as each title needs to be scored vs the others.

Apart from the Measure-TitleSimilarity routine which I believe is optimized, can anyone see a way to improve the speed of this double loop that I am missing?

Edit: I have included the function Measure-TitleSimilarity. I am actually passing the array to the function. The whole topic of quantifying arrays for likeness is fascinating. I have tried with Title.ToCharArray() which changes the magic number to a much higher number. It also can produce a match with two completely different titles as long as the characters are the same. (Ex: 'Mother Teresa' would closely match 'Earthmovers' or 'Thermometer' yet clearly not the same meaning). Cosine Similarity if just one method but it seemed easiest to process. @Mclayton and @bryancook - I see the light with your suggestion, but can't grasp tracking what no longer needs to be looked at for similar words.

Function Get-SimTitles([psobject]$NewPosts) {

  $CKTitles = $NewPosts.title

  foreach ($Ck in $CkTitles) {
    $NewPosts | & { 
      process { 
        if ((Measure-TitleSimilarity $Ck.split(' ') $_.title.split(' ') -gt .2) {
          $_.SimTitles = $_.SimTitles + 1 
        } 
      } 
    } 
  }
}

Function Measure-TitleSimilarity
{
## Based on VectorSimilarity by .AUTHOR Lee Holmes 
## Modified slightly to match use

[CmdletBinding()]
param(
    
    [Parameter(Position = 0)]
    $Title1,

    [Parameter(Position = 1)]   
    $Title2
    
        
) 

$allkeys = @($Title1) + @($Title2) |  Sort-Object -Unique

$set1Hash = @{}
$set2Hash = @{}
$setsToProcess = @($Title1, $Set1Hash), @($Title2, $Set2Hash)

foreach($set in $setsToProcess)
{
    $set[0] | Foreach-Object {
         $value = 1 
         $set[1][$_] = $value
    }
}

$dot = 0
$mag1 = 0
$mag2 = 0

foreach($key in $allkeys)
{
    $dot += $set1Hash[$key] * $set2Hash[$key]
    $mag1 +=  ($set1Hash[$key] * $set1Hash[$key])
    $mag2 +=  ($set2Hash[$key] * $set2Hash[$key])
}

$mag1 = [Math]::Sqrt($mag1)
$mag2 = [Math]::Sqrt($mag2)

return [Math]::Round($dot / ($mag1 * $mag2), 3)

}

Upvotes: 1

Views: 409

Answers (3)

Spodg
Spodg

Reputation: 53

I tried other ways to measure the similarity of titles including word frequency vs frequency of words in a specific title. I think comparing the titles is a topic for a separate post though. I still like the idea of only looping once through.

@MikeSh - based on your answer this is what I came up with.

Function Get-SimTitles([psobject]$NewPosts) {

  $i=0
  $end = $NewPosts.Count - 1
   
  For($i =0; $i -lt $end; $i++){
    
      $k=$i+1        
      $k..$end | Where{{$NewPosts[$i].source -ne $NewPosts[$_].source}} |
      Where-Object {(Measure-TitleSimilarity $NewPosts[$i].title.split(' ') $NewPosts[$_].title.split(' ')) -gt .35}  |
       & {process {$NewPosts[$_].SimTitles = $NewPosts[$_].SimTitles + 1; $NewPosts[$i].SimTitles+=1} }
       } 
                       
 }  

Upvotes: 0

mclayton
mclayton

Reputation: 10125

Partial answer

This includes a couple of the suggestions from the comments:

  • Mathias R. Jessen - "You don't have to compare every title to every title - instead, you only need to compare all unique pairs"

  • my comment - "you could split your titles into word arrays once before you start comparing, and then loop over those, rather than splitting them every time"

$ErrorActionPreference = "Stop";
Set-StrictMode -Version "Latest";

function ConvertTo-WordSets( [psobject] $Posts )
{

    # preprocess each post to break its title into word counts 
    # so we don't need to do it every time we compare 2 posts

    foreach( $post in $Posts )
    {
        $set = new-object PSCustomObject -Property ([ordered] @{
            "Post"   = $post
            "Title"  = $post.Title.Trim()
            "Words"  = $null
            "Counts" = $null
        });
        $set.Words  = $set.Title.Split(" ");
        $set.Counts = $set.Words `
            | group-object `
            | foreach-object `
                -Begin   { $counts = @{} } `
                -Process { $counts.Add($_.Name, $_.Count) } `
                -End     { $counts };
        write-output $set;
    }

}

function Get-SimTitles( [psobject] $NewPosts )
{

    # instead of comparing every object to every object, just compare unique combinations
    # e.g. X compared to Y is the same as Y compared to X so score them both at the same time
    # (and we don't need to compare an object to itself either)

    for( $i = 0; $i -lt $NewPosts.Length; $i++ )
    {
        $left = $NewPosts[$i];
        for( $j = $i + 1; $j -lt $NewPosts.Length; $j++ )
        {
            $right = $NewPosts[$j];
            if ((Measure-TitleSimilarity2 $left $right) -gt .5)
            {
                $left.Post.SimTitles  = $left.Post.SimTitles + 1;
                $right.Post.SimTitles = $right.Post.SimTitles + 1;
            } 
        } 
    }

}

Function Measure-TitleSimilarity
{
    param
    (
        [Parameter(Position = 0)]
        $Left,
        [Parameter(Position = 1)]   
        $Right
    ) 

    # we can use the pre-processed word counts now

    $allkeys = $Left.Words + $Right.Words | Sort-Object -Unique

    $dot = 0
    $mag1 = 0
    $mag2 = 0

    foreach($key in $allkeys)
    {
        $dot  += $Left.Counts[$key] * $Right.Counts[$key]
        $mag1 += $Left.Counts[$key] * $Left.Counts[$key]
        $mag2 += $Right.Counts[$key] * $Right.Counts[$key]
    }

    $mag1 = [Math]::Sqrt($mag1)
    $mag2 = [Math]::Sqrt($mag2)

    return [Math]::Round($dot / ($mag1 * $mag2), 3)

}

Performance

Neither this nor the original are particularly fast for even moderately sized samples, but this one is about 4 times faster.

# get some test data
$sentences = (Invoke-WebRequest -Uri "https://raw.githubusercontent.com/SteveMansfield/MNREAD-sentences/master/XMNREAD01.txt").Content;
$sentences = $sentences.Trim("`n").Split("`n") | foreach-object { $_.Substring(1, $_.Length - 3) };

$posts = $sentences `
    | select-object -First 200 `
    | foreach-object {
        new-object PSCustomObject -Property ([ordered] @{
            "Title"     = $_
            "SimTitles" = 0
        })
    };
Measure-Command { Get-SimTitles $posts; }

# build some test data
$posts = $sentences `
    | select-object -First 200 `
    | foreach-object {
        new-object PSCustomObject -Property ([ordered] @{
            "Title"     = $_
            "SimTitles" = 0
        })
    };

Measure-Command {
    $wordSets = @( ConvertTo-WordSets $Posts );
    Get-SimTitles $wordSets;
}
Size Original This one
10 0.2 0.02
20 0.4 0.1
50 1.9 0.5
100 8.7 1.9
200 38 9
500 246 82

(Times in seconds)

Upvotes: 1

MikeSh
MikeSh

Reputation: 432

you can half the processing time by removing duplicate comparisons. I.e. once you compared "title1" and "title2", you don't need to compare "title2" and "title1" - you already know the answer. So, your inner loop should not start from the beginning of the array

Upvotes: 1

Related Questions