How to improve Postgres pg_trgm for text similarity and make more similar text rank higher?

I am using Postgres's pg_trgm for finding similarity of alphanumeric text based on trigram matching. The text being searched on is a title column containing various titles of news articles.

SELECT set_limit(0.17);

SELECT
    similarity(title, 'A Skating Club in Massachusetts Gathered to Grieve Members Killed in Plane Crash') AS similarity,
    title
FROM
    RECORD
WHERE title % 'A Skating Club in Massachusetts Gathered to Grieve Members Killed in Plane Crash'
ORDER BY
    similarity DESC, created_at DESC;

This gives output:

 similarity |                                               title                                               
------------+---------------------------------------------------------------------------------------------------
          1 | A Skating Club in Massachusetts Gathered to Grieve Members Killed in Plane Crash
  0.2371134 | Skaters killed in crash were Olympic hopefuls
       0.21 | Plane Crash Clues, a New Way to Treat Pain, and More
 0.20754717 | Plane crash in DC: What to know about the aircraft involved
  0.2020202 | Homes in Massachusetts, Pennsylvania and Maryland
 0.19626169 | N.T.S.B. Investigates Washington Plane Crash: What to Know
 0.19512194 | Sen Katie Britt on DC plane crash: 'Nothing fills the void that is left here'
 0.19402985 | Cutting Medicaid, Taxing Scholarships and Killing Invasive Plants: A Guide to the G.O.P. Wishlist
 0.18803419 | Kansas lawmakers vow to find answers after Wichita plane crash in DC
 0.17460318 | Olympic great fights through tears as she talks about victims in plane crash
 0.17391305 | What We Know About the Plane Crash
 0.17391305 | Top killer in US revealed, plus red meat warnings spark questions
(12 rows)

While it works decently well, some of the titles being returned make no sense.

For example, I have no idea why the title What We Know About the Plane Crash has lower similarity than the title Homes in Massachusetts, Pennsylvania and Maryland.

I have no idea why the title What We Know About the Plane Crash has the same similarity as Top killer in US revealed, plus red meat warnings spark questions?

I don't know why Cutting Medicaid, Taxing Scholarships and Killing Invasive Plants: A Guide to the G.O.P. Wishlist has more similarity than the 3 titles below it: Kansas lawmakers vow to find answers after Wichita plane crash in DC, Olympic great fights through tears as she talks about victims in plane crash, and What We Know About the Plane Crash.

Also if I test this SQL:

SELECT title FROM record ORDER BY title <-> 'A Skating Club in Massachusetts Gathered to Grieve Members Killed in Plane Crash' limit 25;

I get:

                                                    title                                                    
-------------------------------------------------------------------------------------------------------------
 A Skating Club in Massachusetts Gathered to Grieve Members Killed in Plane Crash
 Skaters killed in crash were Olympic hopefuls
 Plane Crash Clues, a New Way to Treat Pain, and More
 Plane crash in DC: What to know about the aircraft involved
 Homes in Massachusetts, Pennsylvania and Maryland
 N.T.S.B. Investigates Washington Plane Crash: What to Know
 Sen Katie Britt on DC plane crash: 'Nothing fills the void that is left here'
 Cutting Medicaid, Taxing Scholarships and Killing Invasive Plants: A Guide to the G.O.P. Wishlist
 Kansas lawmakers vow to find answers after Wichita plane crash in DC
 Olympic great fights through tears as she talks about victims in plane crash
 What We Know About the Plane Crash
 Top killer in US revealed, plus red meat warnings spark questions
 Two planes aborted landings at DCA due to helicopters in flight path in week before crash: report
 An Iraqi man who carried out Quran burnings in Sweden is killed in a shooting
 Migrant TDA gang member breaks officer’s arm as 10 indicted in massive guns, drug running operation: police
 FAA makes startling discovery about control tower at time of deadly plane crash
 Olympic gold medalist says DC plane crash hits home as an athlete: 'Could have been me'
 Officials recover key evidence in deadly DC plane crash and more top headlines
 DC airline tragedy recalls two devastating plane crashes that rocked sports world
 Where’s the Compassion? Trump’s Use of the Crash to Attack D.E.I.
 Amber Alert issued for sisters allegedly taken by father suspected of killing their mother
 Hegseth vows to deliver answers on DC plane crash: ‘Completely unacceptable’
 ‘Love Me’: A Romance Six Billion Years in the Making
 DC plane crash air traffic control audio reveals moment controllers saw disaster: 'Tower did you see that?'
 41 Bodies Recovered as Investigators Seek Answers in Washington Crash
(25 rows)

I have no idea why ‘Love Me’: A Romance Six Billion Years in the Making is ranking higher than DC plane crash air traffic control audio reveals moment controllers saw disaster: 'Tower did you see that?' and 41 Bodies Recovered as Investigators Seek Answers in Washington Crash?

Is there a way to improve this further?

Upvotes: 1

Views: 50

Answers (1)

Zegarek
Zegarek

Reputation: 26322

For example, I have no idea why the title [A] has lower similarity than the title [B].
I have no idea why the title [C] has the same similarity as [D]?
I don't know why [E] has more similarity than the 3 titles below it: [F], [G], and [H].
I have no idea why [I] is ranking higher than [J] and [K]?

Trigram comparisons in pg_trgm don't have a notion of conceptual or even word-level similarity, it really just explodes the text into a bunch of overlapping 3-character groups and checks how many of them are shared between the operands, compared to how long they are.

The perceived similarity in titles this long does affect the results but not to an extent you might've hoped for - it pretty much drowns in noise. On trigram level, these are about as dissimilar as it gets and lowering the match threshold/similarity limit this far is asking for false positives.
Refer to the example at the end and the demo to see how pg_trgm really interprets your values.

Is there a way to improve this further?

Combine this with Full Text Search and fuzzystrmatch. The pg_trgm doc even mentions this as a recommendation:

F.33.5. Text Search Integration

Trigram matching is a very useful tool when used in conjunction with a full text index. In particular it can help to recognize misspelled input words that will not be matched directly by the full text search mechanism.

The first step is to generate an auxiliary table containing all the unique words in the documents:

CREATE TABLE words AS SELECT word FROM
    ts_stat('SELECT to_tsvector(''simple'', bodytext) FROM documents');

where documents is a table that has a text field bodytext that we wish to search. The reason for using the simple configuration with the to_tsvector function, instead of using a language-specific configuration, is that we want a list of the original (unstemmed) words.

Next, create a trigram index on the word column:

CREATE INDEX words_idx ON words USING GIN (word gin_trgm_ops);

Now, a SELECT query similar to the previous example can be used to suggest spellings for misspelled words in user search terms. A useful extra test is to require that the selected words are also of similar length to the misspelled word.

Notice how suddenly the ones with skating and a plane crash score highest:

select row_number()over(order by ts_rank(  to_tsvector('english',title)
                                         , plainto_tsquery('english',target_title)
                                         , 2) desc) as place
     , ts_headline('english'
                   ,title
                   ,plainto_tsquery('english',target_title)
                   ,'HighlightAll=true, StartSel=**, StopSel=**')
     , ts_rank(  to_tsvector('english',title)
               , plainto_tsquery('english',target_title)
               , 2)
from test
cross join (values($txt$A Skating Club in Massachusetts Gathered to Grieve Members Killed in Plane Crash$txt$)
  )as a(target_title)
order by place;

It's also visible how the words got stemmed to match multiple variants of a word

A Skating Club in Massachusetts Gathered to Grieve Members Killed in Plane Crash

place ts_headline ts_rank
1 A Skating Club in Massachusetts Gathered to Grieve Members Killed in Plane Crash 0.10667805
2 What We Know About the Plane Crash 0.033034407
3 Skaters killed in crash were Olympic hopefuls 0.019700171
4 Plane crash in DC: What to know about the aircraft involved 0.016517203
5 N.T.S.B. Investigates Washington Plane Crash: What to Know 0.016517203
6 Plane Crash Clues, a New Way to Treat Pain, and More 0.014157603
7 Olympic great fights through tears as she talks about victims in plane crash 0.0123879025
8 Hegseth vows to deliver answers on DC plane crash: ‘Completely unacceptable’ 0.011011469
9 Kansas lawmakers vow to find answers after Wichita plane crash in DC 0.011011469
10 FAA makes startling discovery about control tower at time of deadly plane crash 0.009910322
11 Sen Katie Britt on DC plane crash: 'Nothing fills the void that is left here' 0.009910322
12 Officials recover key evidence in deadly DC plane crash and more top headlines 0.009910322
13 Olympic gold medalist says DC plane crash hits home as an athlete: 'Could have been me' 0.009009384
14 DC airline tragedy recalls two devastating plane crashes that rocked sports world 0.009009384
15 DC plane crash air traffic control audio reveals moment controllers saw disaster: 'Tower did you see that?' 0.0070788013
16 Two planes aborted landings at DCA due to helicopters in flight path in week before crash: report 0.00094642764
17 Homes in Massachusetts, Pennsylvania and Maryland 2.5e-21

You can show_trgm() and see exactly what's being compared, similar to how you can inspect a tsvector. Although subtracting/intersecting the trigram arrays isn't (all) that's happening under the hood, passing that into array functions and operators can give you some abstract estimation:
demo at db<>fiddle

SELECT similarity(title, target_title)
     , title
     , cardinality(show_trgm(title)) as len
     , cardinality(trgm_difference) as diff
     , cardinality(trgm_common) as common
     , trgm_common
     , trgm_difference
FROM test
cross join (values($txt$A Skating Club in Massachusetts Gathered to Grieve Members Killed in Plane Crash$txt$)
)as a(target_title)
cross join lateral(values
  (array(select unnest(show_trgm(title))
         except 
         select unnest(show_trgm(target_title)) ))
)as b(trgm_difference)
cross join lateral(values
  (array(select unnest(show_trgm(title))
         intersect 
         select unnest(show_trgm(target_title)) ))
)as c(trgm_common)
WHERE title % target_title
ORDER BY similarity DESC, created_at DESC;
similarity title len diff common trgm_common trgm_difference
1 A Skating Club in Massachusetts Gathered to Grieve Members Killed in Plane Crash 74 0 74 {tts,gat,"ng ",gri,"sh ",kil," me"," cr",ett," in",rie,ing,"ub ",ska,emb,lub,"rs "," cl"," k",ass," to",iev,sac," g","to ","in ",use,mbe,red,ane," t",ers,lle," s"," m"," p",set,eve,ash,mem," c",ssa,lan,her,ras,ach,"ed "," sk",ere,"ve ",hus,ber," gr"," ki"," a",pla,tin," i",ati," ga",cra,chu," ma"," a ",the,"ne ",kat,ath,mas,ill,led,"ts ",clu," pl"} {}
0.2371134 Skaters killed in crash were Olympic hopefuls 46 23 23 {"sh ",kil," cr"," in",ska,"rs "," k","in ",ers,lle," s",ash," c",ras,"ed "," sk",ere," ki"," i",cra,kat,ill,led} {ate,wer,uls," h","re ",ful,efu,pic,"ls ","ic ",hop,ope,lym," we",ymp,mpi,pef,oly," ol",ter," w"," o"," ho"}
0.21 Plane Crash Clues, a New Way to Treat Pain, and More 47 26 21 {"sh "," cr"," cl"," to","to ","in ",ane," t"," m"," p",ash," c",lan,ras," a",pla,cra," a ","ne ",clu," pl"} {" pa",mor,"nd ",ain,"re ",eat,ues," n",rea,"ay "," wa"," tr",lue," an",way," mo",ore,"at ",pai,"ew ","es ",and,new," w",tre," ne"}
0.20754717 Plane crash in DC: What to know about the aircraft involved 54 32 22 {"sh "," cr"," in"," k"," to","to ","in ",ane," t"," p",ash," c",lan,ras,"ed "," a",pla," i",cra,the,"ne "," pl"} {irc,wha,aft," kn",ved,abo," wh"," dc","ut "," th",nvo,rcr," ai","ft ",hat,olv,"at ",air,raf,"he ",now,kno,bou," d","dc ",vol," ab"," w",lve,"ow ",out,inv}
0.2020202 Homes in Massachusetts, Pennsylvania and Maryland 45 25 20 {ett," in",ass,sac,"in ",use," m"," p",set,ssa,lan,ach,hus," a"," i",chu," ma",mas,"ts ",tts} {ylv,pen,"ia ",mes,hom,"nd "," h",van,nns,ryl,ome,ani," an",ary,mar,nia,nsy,lva,enn,yla,"es ",and,syl," pe"," ho"}
0.19626169 N.T.S.B. Investigates Washington Plane Crash: What to Know 54 33 21 {"sh "," cr"," in",ing," k"," to","to ",ane," t"," s"," p",ash," c",lan,ras,pla," i",cra,"ne "," pl",gat} {est,ate," t ",tig,ngt,wha,"on "," kn"," wh",gto," n"," wa",nve," s ",iga,hat,"at "," b ",ton,tes,now," n "," b",shi,kno,"es ",sti,ves," w","ow ",hin,was,inv}
0.19512194 Sen Katie Britt on DC plane crash: 'Nothing fills the void that is left here' 73 49 24 {"ng ","sh "," cr",ing," k",ane," t"," s"," p",ash," c",lan,her,ras,ere,pla," i",ati,cra,the,"ne ",kat,ill," pl"} {" v"," se"," is","ie ","on ",fil," h",itt,"re "," vo"," ka",tha,not," dc",sen," n","id ",lls," th"," f","ls "," br"," he","ft ",hat,tie,lef,eft,"at ",rit," fi",bri,"he "," b",oid," le","tt "," d","dc "," no",voi,"is ","en ",oth,thi," o"," on",hin," l"}
0.19402985 Cutting Medicaid, Taxing Scholarships and Killing Invasive Plants: A Guide to the G.O.P. Wishlist 86 60 26 {"ng ",kil," me"," in",ing," k"," to"," g","to "," t"," s"," m"," p"," c",lan,"ve "," ki"," a",pla,tin," i"," a ",the,ill,"ts "," pl"} {inv,aid,dic,uid,ola,lin,axi,utt,ist,"nd ",ica,asi,nts,ive,lli,"ps ",hol,cai,"id "," th",hli," sc",nva," an",hip,gui,ish,tti," g ",sch,xin,shl,wis,cho," o ",ars,med,cut,"he "," p "," cu",siv,lar,"de "," wi",shi,edi,rsh,ips,tax,"st ",vas,and,ide,ant,lis," ta"," w"," gu"," o"}
0.18803419 Kansas lawmakers vow to find answers after Wichita plane crash in DC 65 43 22 {"sh "," cr"," in","rs "," k"," to","to ","in ",ane," t",ers," p",ash," c",lan,ras," a",pla," i",cra,"ne "," pl"} {" v",wma,"ta ","er ",nsa,wer,ake,"nd ",swe,vow,ker,mak," vo",aft," ka",nsw," dc",wic," f",hit," an",fte,law,awm,kan," fi"," la",ind," wi"," d","dc ",ter,ich,"as ",fin," af"," w",chi,sas,"ow ",ita,ans," l"}
0.17460318 Olympic great fights through tears as she talks about victims in plane crash 74 52 22 {"sh "," cr"," in","rs "," g","in ",ane," t"," s"," p",ash," c",lan,ras," gr"," a",pla," i",cra,"ne ","ts "," pl"} {" v"," as","ms ",alk,eat,abo,"ut ",tal,oug,rea,pic,hro,fig," th",vic,igh," f",hts,"ic ",she,lym," vi",thr,ymp,mpi,"at ",tim," fi"," te"," sh",gre,cti,ars,oly,ugh,"he ",rou," ol","ks ",ict,"gh ",bou,lks,ear,ims,"as ",ght," ta"," ab"," o",out,tea}
0.17391305 What We Know About the Plane Crash 34 18 16 {"sh "," cr"," k",ane," t"," p",ash," c",lan,ras," a",pla,cra,the,"ne "," pl"} {wha," kn",abo," wh","ut "," th",hat," we","at ","we ","he ",now,kno,bou," ab"," w","ow ",out}
0.17391305 Top killer in US revealed, plus red meat warnings spark questions 61 41 20 {kil," me"," in",ing," k"," to","in ",red," t",lle," s"," m"," p",eve,"ed "," ki"," i",ill,led," pl"} {est,rev,ark,war,"er ",ale,lus,"us ","ns ",eat,ion,ngs,"gs ",ues," sp",spa," wa",tio,par," re"," u","at ",ler,"rk ",arn,plu,nin," qu",que,top,rni," q",mea," r",sti,"op "," w",vea," us",eal,ons}

Upvotes: 2

Related Questions