Reputation: 53
I have an object $Posts which contain a title, and a SimTitles field amongst a few others. I need to compare each title to the other titles and give it a Similarity score in the SimTitles field. So if I have 80 $Posts, it will need to cover 6400 re-iterations as each title needs to be scored vs the others.
Apart from the Measure-TitleSimilarity routine which I believe is optimized, can anyone see a way to improve the speed of this double loop that I am missing?
Edit: I have included the function Measure-TitleSimilarity. I am actually passing the array to the function. The whole topic of quantifying arrays for likeness is fascinating. I have tried with Title.ToCharArray() which changes the magic number to a much higher number. It also can produce a match with two completely different titles as long as the characters are the same. (Ex: 'Mother Teresa' would closely match 'Earthmovers' or 'Thermometer' yet clearly not the same meaning). Cosine Similarity if just one method but it seemed easiest to process. @Mclayton and @bryancook - I see the light with your suggestion, but can't grasp tracking what no longer needs to be looked at for similar words.
Function Get-SimTitles([psobject]$NewPosts) {
$CKTitles = $NewPosts.title
foreach ($Ck in $CkTitles) {
$NewPosts | & {
process {
if ((Measure-TitleSimilarity $Ck.split(' ') $_.title.split(' ') -gt .2) {
$_.SimTitles = $_.SimTitles + 1
}
}
}
}
}
Function Measure-TitleSimilarity
{
## Based on VectorSimilarity by .AUTHOR Lee Holmes
## Modified slightly to match use
[CmdletBinding()]
param(
[Parameter(Position = 0)]
$Title1,
[Parameter(Position = 1)]
$Title2
)
$allkeys = @($Title1) + @($Title2) | Sort-Object -Unique
$set1Hash = @{}
$set2Hash = @{}
$setsToProcess = @($Title1, $Set1Hash), @($Title2, $Set2Hash)
foreach($set in $setsToProcess)
{
$set[0] | Foreach-Object {
$value = 1
$set[1][$_] = $value
}
}
$dot = 0
$mag1 = 0
$mag2 = 0
foreach($key in $allkeys)
{
$dot += $set1Hash[$key] * $set2Hash[$key]
$mag1 += ($set1Hash[$key] * $set1Hash[$key])
$mag2 += ($set2Hash[$key] * $set2Hash[$key])
}
$mag1 = [Math]::Sqrt($mag1)
$mag2 = [Math]::Sqrt($mag2)
return [Math]::Round($dot / ($mag1 * $mag2), 3)
}
Upvotes: 1
Views: 409
Reputation: 53
I tried other ways to measure the similarity of titles including word frequency vs frequency of words in a specific title. I think comparing the titles is a topic for a separate post though. I still like the idea of only looping once through.
@MikeSh - based on your answer this is what I came up with.
Function Get-SimTitles([psobject]$NewPosts) {
$i=0
$end = $NewPosts.Count - 1
For($i =0; $i -lt $end; $i++){
$k=$i+1
$k..$end | Where{{$NewPosts[$i].source -ne $NewPosts[$_].source}} |
Where-Object {(Measure-TitleSimilarity $NewPosts[$i].title.split(' ') $NewPosts[$_].title.split(' ')) -gt .35} |
& {process {$NewPosts[$_].SimTitles = $NewPosts[$_].SimTitles + 1; $NewPosts[$i].SimTitles+=1} }
}
}
Upvotes: 0
Reputation: 10125
Partial answer
This includes a couple of the suggestions from the comments:
Mathias R. Jessen - "You don't have to compare every title to every title - instead, you only need to compare all unique pairs"
my comment - "you could split your titles into word arrays once before you start comparing, and then loop over those, rather than splitting them every time"
$ErrorActionPreference = "Stop";
Set-StrictMode -Version "Latest";
function ConvertTo-WordSets( [psobject] $Posts )
{
# preprocess each post to break its title into word counts
# so we don't need to do it every time we compare 2 posts
foreach( $post in $Posts )
{
$set = new-object PSCustomObject -Property ([ordered] @{
"Post" = $post
"Title" = $post.Title.Trim()
"Words" = $null
"Counts" = $null
});
$set.Words = $set.Title.Split(" ");
$set.Counts = $set.Words `
| group-object `
| foreach-object `
-Begin { $counts = @{} } `
-Process { $counts.Add($_.Name, $_.Count) } `
-End { $counts };
write-output $set;
}
}
function Get-SimTitles( [psobject] $NewPosts )
{
# instead of comparing every object to every object, just compare unique combinations
# e.g. X compared to Y is the same as Y compared to X so score them both at the same time
# (and we don't need to compare an object to itself either)
for( $i = 0; $i -lt $NewPosts.Length; $i++ )
{
$left = $NewPosts[$i];
for( $j = $i + 1; $j -lt $NewPosts.Length; $j++ )
{
$right = $NewPosts[$j];
if ((Measure-TitleSimilarity2 $left $right) -gt .5)
{
$left.Post.SimTitles = $left.Post.SimTitles + 1;
$right.Post.SimTitles = $right.Post.SimTitles + 1;
}
}
}
}
Function Measure-TitleSimilarity
{
param
(
[Parameter(Position = 0)]
$Left,
[Parameter(Position = 1)]
$Right
)
# we can use the pre-processed word counts now
$allkeys = $Left.Words + $Right.Words | Sort-Object -Unique
$dot = 0
$mag1 = 0
$mag2 = 0
foreach($key in $allkeys)
{
$dot += $Left.Counts[$key] * $Right.Counts[$key]
$mag1 += $Left.Counts[$key] * $Left.Counts[$key]
$mag2 += $Right.Counts[$key] * $Right.Counts[$key]
}
$mag1 = [Math]::Sqrt($mag1)
$mag2 = [Math]::Sqrt($mag2)
return [Math]::Round($dot / ($mag1 * $mag2), 3)
}
Performance
Neither this nor the original are particularly fast for even moderately sized samples, but this one is about 4 times faster.
# get some test data
$sentences = (Invoke-WebRequest -Uri "https://raw.githubusercontent.com/SteveMansfield/MNREAD-sentences/master/XMNREAD01.txt").Content;
$sentences = $sentences.Trim("`n").Split("`n") | foreach-object { $_.Substring(1, $_.Length - 3) };
$posts = $sentences `
| select-object -First 200 `
| foreach-object {
new-object PSCustomObject -Property ([ordered] @{
"Title" = $_
"SimTitles" = 0
})
};
Measure-Command { Get-SimTitles $posts; }
# build some test data
$posts = $sentences `
| select-object -First 200 `
| foreach-object {
new-object PSCustomObject -Property ([ordered] @{
"Title" = $_
"SimTitles" = 0
})
};
Measure-Command {
$wordSets = @( ConvertTo-WordSets $Posts );
Get-SimTitles $wordSets;
}
Size | Original | This one |
---|---|---|
10 | 0.2 | 0.02 |
20 | 0.4 | 0.1 |
50 | 1.9 | 0.5 |
100 | 8.7 | 1.9 |
200 | 38 | 9 |
500 | 246 | 82 |
(Times in seconds)
Upvotes: 1
Reputation: 432
you can half the processing time by removing duplicate comparisons. I.e. once you compared "title1" and "title2", you don't need to compare "title2" and "title1" - you already know the answer. So, your inner loop should not start from the beginning of the array
Upvotes: 1