Reputation: 8746
I am using Postgres's pg_trgm for finding similarity of alphanumeric text based on trigram matching. The text being searched on is a title column containing various titles of news articles.
SELECT set_limit(0.17);
SELECT
similarity(title, 'A Skating Club in Massachusetts Gathered to Grieve Members Killed in Plane Crash') AS similarity,
title
FROM
RECORD
WHERE title % 'A Skating Club in Massachusetts Gathered to Grieve Members Killed in Plane Crash'
ORDER BY
similarity DESC, created_at DESC;
This gives output:
similarity | title
------------+---------------------------------------------------------------------------------------------------
1 | A Skating Club in Massachusetts Gathered to Grieve Members Killed in Plane Crash
0.2371134 | Skaters killed in crash were Olympic hopefuls
0.21 | Plane Crash Clues, a New Way to Treat Pain, and More
0.20754717 | Plane crash in DC: What to know about the aircraft involved
0.2020202 | Homes in Massachusetts, Pennsylvania and Maryland
0.19626169 | N.T.S.B. Investigates Washington Plane Crash: What to Know
0.19512194 | Sen Katie Britt on DC plane crash: 'Nothing fills the void that is left here'
0.19402985 | Cutting Medicaid, Taxing Scholarships and Killing Invasive Plants: A Guide to the G.O.P. Wishlist
0.18803419 | Kansas lawmakers vow to find answers after Wichita plane crash in DC
0.17460318 | Olympic great fights through tears as she talks about victims in plane crash
0.17391305 | What We Know About the Plane Crash
0.17391305 | Top killer in US revealed, plus red meat warnings spark questions
(12 rows)
While it works decently well, some of the titles being returned make no sense.
For example, I have no idea why the title What We Know About the Plane Crash
has lower similarity
than the title Homes in Massachusetts, Pennsylvania and Maryland
.
I have no idea why the title What We Know About the Plane Crash
has the same similarity as Top killer in US revealed, plus red meat warnings spark questions
?
I don't know why Cutting Medicaid, Taxing Scholarships and Killing Invasive Plants: A Guide to the G.O.P. Wishlist
has more similarity than the 3 titles below it: Kansas lawmakers vow to find answers after Wichita plane crash in DC
, Olympic great fights through tears as she talks about victims in plane crash
, and What We Know About the Plane Crash
.
Also if I test this SQL:
SELECT title FROM record ORDER BY title <-> 'A Skating Club in Massachusetts Gathered to Grieve Members Killed in Plane Crash' limit 25;
I get:
title
-------------------------------------------------------------------------------------------------------------
A Skating Club in Massachusetts Gathered to Grieve Members Killed in Plane Crash
Skaters killed in crash were Olympic hopefuls
Plane Crash Clues, a New Way to Treat Pain, and More
Plane crash in DC: What to know about the aircraft involved
Homes in Massachusetts, Pennsylvania and Maryland
N.T.S.B. Investigates Washington Plane Crash: What to Know
Sen Katie Britt on DC plane crash: 'Nothing fills the void that is left here'
Cutting Medicaid, Taxing Scholarships and Killing Invasive Plants: A Guide to the G.O.P. Wishlist
Kansas lawmakers vow to find answers after Wichita plane crash in DC
Olympic great fights through tears as she talks about victims in plane crash
What We Know About the Plane Crash
Top killer in US revealed, plus red meat warnings spark questions
Two planes aborted landings at DCA due to helicopters in flight path in week before crash: report
An Iraqi man who carried out Quran burnings in Sweden is killed in a shooting
Migrant TDA gang member breaks officer’s arm as 10 indicted in massive guns, drug running operation: police
FAA makes startling discovery about control tower at time of deadly plane crash
Olympic gold medalist says DC plane crash hits home as an athlete: 'Could have been me'
Officials recover key evidence in deadly DC plane crash and more top headlines
DC airline tragedy recalls two devastating plane crashes that rocked sports world
Where’s the Compassion? Trump’s Use of the Crash to Attack D.E.I.
Amber Alert issued for sisters allegedly taken by father suspected of killing their mother
Hegseth vows to deliver answers on DC plane crash: ‘Completely unacceptable’
‘Love Me’: A Romance Six Billion Years in the Making
DC plane crash air traffic control audio reveals moment controllers saw disaster: 'Tower did you see that?'
41 Bodies Recovered as Investigators Seek Answers in Washington Crash
(25 rows)
I have no idea why ‘Love Me’: A Romance Six Billion Years in the Making
is ranking higher than DC plane crash air traffic control audio reveals moment controllers saw disaster: 'Tower did you see that?'
and 41 Bodies Recovered as Investigators Seek Answers in Washington Crash
?
Is there a way to improve this further?
Upvotes: 1
Views: 50
Reputation: 26322
For example, I have no idea why the title [A] has lower similarity than the title [B].
I have no idea why the title [C] has the same similarity as [D]?
I don't know why [E] has more similarity than the 3 titles below it: [F], [G], and [H].
I have no idea why [I] is ranking higher than [J] and [K]?
Trigram comparisons in pg_trgm
don't have a notion of conceptual or even word-level similarity, it really just explodes the text into a bunch of overlapping 3-character groups and checks how many of them are shared between the operands, compared to how long they are.
The perceived similarity in titles this long does affect the results but not to an extent you might've hoped for - it pretty much drowns in noise. On trigram level, these are about as dissimilar as it gets and lowering the match threshold/similarity limit this far is asking for false positives.
Refer to the example at the end and the demo to see how pg_trgm
really interprets your values.
Is there a way to improve this further?
Combine this with Full Text Search and fuzzystrmatch
. The pg_trgm
doc even mentions this as a recommendation:
F.33.5. Text Search Integration
Trigram matching is a very useful tool when used in conjunction with a full text index. In particular it can help to recognize misspelled input words that will not be matched directly by the full text search mechanism.
The first step is to generate an auxiliary table containing all the unique words in the documents:
CREATE TABLE words AS SELECT word FROM ts_stat('SELECT to_tsvector(''simple'', bodytext) FROM documents');
where documents is a table that has a text field bodytext that we wish to search. The reason for using the simple configuration with the to_tsvector function, instead of using a language-specific configuration, is that we want a list of the original (unstemmed) words.
Next, create a trigram index on the word column:
CREATE INDEX words_idx ON words USING GIN (word gin_trgm_ops);
Now, a
SELECT
query similar to the previous example can be used to suggest spellings for misspelled words in user search terms. A useful extra test is to require that the selected words are also of similar length to the misspelled word.
Notice how suddenly the ones with skating and a plane crash score highest:
select row_number()over(order by ts_rank( to_tsvector('english',title)
, plainto_tsquery('english',target_title)
, 2) desc) as place
, ts_headline('english'
,title
,plainto_tsquery('english',target_title)
,'HighlightAll=true, StartSel=**, StopSel=**')
, ts_rank( to_tsvector('english',title)
, plainto_tsquery('english',target_title)
, 2)
from test
cross join (values($txt$A Skating Club in Massachusetts Gathered to Grieve Members Killed in Plane Crash$txt$)
)as a(target_title)
order by place;
It's also visible how the words got stemmed to match multiple variants of a word
A Skating Club in Massachusetts Gathered to Grieve Members Killed in Plane Crash
place | ts_headline | ts_rank |
---|---|---|
1 | A Skating Club in Massachusetts Gathered to Grieve Members Killed in Plane Crash | 0.10667805 |
2 | What We Know About the Plane Crash | 0.033034407 |
3 | Skaters killed in crash were Olympic hopefuls | 0.019700171 |
4 | Plane crash in DC: What to know about the aircraft involved | 0.016517203 |
5 | N.T.S.B. Investigates Washington Plane Crash: What to Know | 0.016517203 |
6 | Plane Crash Clues, a New Way to Treat Pain, and More | 0.014157603 |
7 | Olympic great fights through tears as she talks about victims in plane crash | 0.0123879025 |
8 | Hegseth vows to deliver answers on DC plane crash: ‘Completely unacceptable’ | 0.011011469 |
9 | Kansas lawmakers vow to find answers after Wichita plane crash in DC | 0.011011469 |
10 | FAA makes startling discovery about control tower at time of deadly plane crash | 0.009910322 |
11 | Sen Katie Britt on DC plane crash: 'Nothing fills the void that is left here' | 0.009910322 |
12 | Officials recover key evidence in deadly DC plane crash and more top headlines | 0.009910322 |
13 | Olympic gold medalist says DC plane crash hits home as an athlete: 'Could have been me' | 0.009009384 |
14 | DC airline tragedy recalls two devastating plane crashes that rocked sports world | 0.009009384 |
15 | DC plane crash air traffic control audio reveals moment controllers saw disaster: 'Tower did you see that?' | 0.0070788013 |
16 | Two planes aborted landings at DCA due to helicopters in flight path in week before crash: report | 0.00094642764 |
17 | Homes in Massachusetts, Pennsylvania and Maryland | 2.5e-21 |
You can show_trgm()
and see exactly what's being compared, similar to how you can inspect a tsvector
. Although subtracting/intersecting the trigram arrays isn't (all) that's happening under the hood, passing that into array functions and operators can give you some abstract estimation:
demo at db<>fiddle
SELECT similarity(title, target_title)
, title
, cardinality(show_trgm(title)) as len
, cardinality(trgm_difference) as diff
, cardinality(trgm_common) as common
, trgm_common
, trgm_difference
FROM test
cross join (values($txt$A Skating Club in Massachusetts Gathered to Grieve Members Killed in Plane Crash$txt$)
)as a(target_title)
cross join lateral(values
(array(select unnest(show_trgm(title))
except
select unnest(show_trgm(target_title)) ))
)as b(trgm_difference)
cross join lateral(values
(array(select unnest(show_trgm(title))
intersect
select unnest(show_trgm(target_title)) ))
)as c(trgm_common)
WHERE title % target_title
ORDER BY similarity DESC, created_at DESC;
similarity | title | len | diff | common | trgm_common | trgm_difference |
---|---|---|---|---|---|---|
1 | A Skating Club in Massachusetts Gathered to Grieve Members Killed in Plane Crash | 74 | 0 | 74 | {tts,gat,"ng ",gri,"sh ",kil," me"," cr",ett," in",rie,ing,"ub ",ska,emb,lub,"rs "," cl"," k",ass," to",iev,sac," g","to ","in ",use,mbe,red,ane," t",ers,lle," s"," m"," p",set,eve,ash,mem," c",ssa,lan,her,ras,ach,"ed "," sk",ere,"ve ",hus,ber," gr"," ki"," a",pla,tin," i",ati," ga",cra,chu," ma"," a ",the,"ne ",kat,ath,mas,ill,led,"ts ",clu," pl"} | {} |
0.2371134 | Skaters killed in crash were Olympic hopefuls | 46 | 23 | 23 | {"sh ",kil," cr"," in",ska,"rs "," k","in ",ers,lle," s",ash," c",ras,"ed "," sk",ere," ki"," i",cra,kat,ill,led} | {ate,wer,uls," h","re ",ful,efu,pic,"ls ","ic ",hop,ope,lym," we",ymp,mpi,pef,oly," ol",ter," w"," o"," ho"} |
0.21 | Plane Crash Clues, a New Way to Treat Pain, and More | 47 | 26 | 21 | {"sh "," cr"," cl"," to","to ","in ",ane," t"," m"," p",ash," c",lan,ras," a",pla,cra," a ","ne ",clu," pl"} | {" pa",mor,"nd ",ain,"re ",eat,ues," n",rea,"ay "," wa"," tr",lue," an",way," mo",ore,"at ",pai,"ew ","es ",and,new," w",tre," ne"} |
0.20754717 | Plane crash in DC: What to know about the aircraft involved | 54 | 32 | 22 | {"sh "," cr"," in"," k"," to","to ","in ",ane," t"," p",ash," c",lan,ras,"ed "," a",pla," i",cra,the,"ne "," pl"} | {irc,wha,aft," kn",ved,abo," wh"," dc","ut "," th",nvo,rcr," ai","ft ",hat,olv,"at ",air,raf,"he ",now,kno,bou," d","dc ",vol," ab"," w",lve,"ow ",out,inv} |
0.2020202 | Homes in Massachusetts, Pennsylvania and Maryland | 45 | 25 | 20 | {ett," in",ass,sac,"in ",use," m"," p",set,ssa,lan,ach,hus," a"," i",chu," ma",mas,"ts ",tts} | {ylv,pen,"ia ",mes,hom,"nd "," h",van,nns,ryl,ome,ani," an",ary,mar,nia,nsy,lva,enn,yla,"es ",and,syl," pe"," ho"} |
0.19626169 | N.T.S.B. Investigates Washington Plane Crash: What to Know | 54 | 33 | 21 | {"sh "," cr"," in",ing," k"," to","to ",ane," t"," s"," p",ash," c",lan,ras,pla," i",cra,"ne "," pl",gat} | {est,ate," t ",tig,ngt,wha,"on "," kn"," wh",gto," n"," wa",nve," s ",iga,hat,"at "," b ",ton,tes,now," n "," b",shi,kno,"es ",sti,ves," w","ow ",hin,was,inv} |
0.19512194 | Sen Katie Britt on DC plane crash: 'Nothing fills the void that is left here' | 73 | 49 | 24 | {"ng ","sh "," cr",ing," k",ane," t"," s"," p",ash," c",lan,her,ras,ere,pla," i",ati,cra,the,"ne ",kat,ill," pl"} | {" v"," se"," is","ie ","on ",fil," h",itt,"re "," vo"," ka",tha,not," dc",sen," n","id ",lls," th"," f","ls "," br"," he","ft ",hat,tie,lef,eft,"at ",rit," fi",bri,"he "," b",oid," le","tt "," d","dc "," no",voi,"is ","en ",oth,thi," o"," on",hin," l"} |
0.19402985 | Cutting Medicaid, Taxing Scholarships and Killing Invasive Plants: A Guide to the G.O.P. Wishlist | 86 | 60 | 26 | {"ng ",kil," me"," in",ing," k"," to"," g","to "," t"," s"," m"," p"," c",lan,"ve "," ki"," a",pla,tin," i"," a ",the,ill,"ts "," pl"} | {inv,aid,dic,uid,ola,lin,axi,utt,ist,"nd ",ica,asi,nts,ive,lli,"ps ",hol,cai,"id "," th",hli," sc",nva," an",hip,gui,ish,tti," g ",sch,xin,shl,wis,cho," o ",ars,med,cut,"he "," p "," cu",siv,lar,"de "," wi",shi,edi,rsh,ips,tax,"st ",vas,and,ide,ant,lis," ta"," w"," gu"," o"} |
0.18803419 | Kansas lawmakers vow to find answers after Wichita plane crash in DC | 65 | 43 | 22 | {"sh "," cr"," in","rs "," k"," to","to ","in ",ane," t",ers," p",ash," c",lan,ras," a",pla," i",cra,"ne "," pl"} | {" v",wma,"ta ","er ",nsa,wer,ake,"nd ",swe,vow,ker,mak," vo",aft," ka",nsw," dc",wic," f",hit," an",fte,law,awm,kan," fi"," la",ind," wi"," d","dc ",ter,ich,"as ",fin," af"," w",chi,sas,"ow ",ita,ans," l"} |
0.17460318 | Olympic great fights through tears as she talks about victims in plane crash | 74 | 52 | 22 | {"sh "," cr"," in","rs "," g","in ",ane," t"," s"," p",ash," c",lan,ras," gr"," a",pla," i",cra,"ne ","ts "," pl"} | {" v"," as","ms ",alk,eat,abo,"ut ",tal,oug,rea,pic,hro,fig," th",vic,igh," f",hts,"ic ",she,lym," vi",thr,ymp,mpi,"at ",tim," fi"," te"," sh",gre,cti,ars,oly,ugh,"he ",rou," ol","ks ",ict,"gh ",bou,lks,ear,ims,"as ",ght," ta"," ab"," o",out,tea} |
0.17391305 | What We Know About the Plane Crash | 34 | 18 | 16 | {"sh "," cr"," k",ane," t"," p",ash," c",lan,ras," a",pla,cra,the,"ne "," pl"} | {wha," kn",abo," wh","ut "," th",hat," we","at ","we ","he ",now,kno,bou," ab"," w","ow ",out} |
0.17391305 | Top killer in US revealed, plus red meat warnings spark questions | 61 | 41 | 20 | {kil," me"," in",ing," k"," to","in ",red," t",lle," s"," m"," p",eve,"ed "," ki"," i",ill,led," pl"} | {est,rev,ark,war,"er ",ale,lus,"us ","ns ",eat,ion,ngs,"gs ",ues," sp",spa," wa",tio,par," re"," u","at ",ler,"rk ",arn,plu,nin," qu",que,top,rni," q",mea," r",sti,"op "," w",vea," us",eal,ons} |
Upvotes: 2