Get Non-Aggregate data in Aggregate query

Question

I am no SQL pro, but I believe I have solved my problem, albeit in a fairly non-performant way. I'm hoping someone can point to a better method than I've come up with. I'm trying to find duplicate or similar content in a term index, created by RelevanSSI (a full text search plugin for Wordpress) - however this is happening outside of the Wordpress installation and it's actual database so Wordpress, it's API's and any other tables normally associated with it are out of the scope of this.

The RelevanSSI Index table looks like this:

CREATE TABLE `wp_relevanssi` (
 `doc` bigint(20) NOT NULL DEFAULT '0',
 `term` varchar(50) NOT NULL DEFAULT '0',
 `content` mediumint(9) NOT NULL DEFAULT '0',
 `title` mediumint(9) NOT NULL DEFAULT '0',
 `comment` mediumint(9) NOT NULL DEFAULT '0',
 `tag` mediumint(9) NOT NULL DEFAULT '0',
 `link` mediumint(9) NOT NULL DEFAULT '0',
 `author` mediumint(9) NOT NULL DEFAULT '0',
 `category` mediumint(9) NOT NULL DEFAULT '0',
 `excerpt` mediumint(9) NOT NULL DEFAULT '0',
 `taxonomy` mediumint(9) NOT NULL DEFAULT '0',
 `customfield` mediumint(9) NOT NULL DEFAULT '0',
 `mysqlcolumn` mediumint(9) NOT NULL DEFAULT '0',
 `taxonomy_detail` longtext NOT NULL,
 `customfield_detail` longtext NOT NULL,
 `mysqlcolumn_detail` longtext NOT NULL,
 `type` varchar(210) NOT NULL DEFAULT 'post',
 `item` bigint(20) NOT NULL DEFAULT '0',
 `term_reverse` varchar(50) NOT NULL DEFAULT '0',
 UNIQUE KEY `doctermitem` (`doc`,`term`,`item`),
 KEY `terms` (`term`(20)),
 KEY `docs` (`doc`),
 KEY `typeitem` (`type`,`item`),
 KEY `relevanssi_term_reverse_idx` (`term_reverse`(10))
) ENGINE=InnoDB DEFAULT CHARSET=utf8

And I am successfully getting (I think) the info I want with the following query:

SELECT r1.doc, r2.doc, 
    50 * COUNT( r1.term ) * (
        (c1.total + c2.total) / 
        ( c1.total * c2.total ) 
    ) AS ScorePct
FROM  `wp_relevanssi` r1
LEFT JOIN  `wp_relevanssi` r2 
ON r1.term = r2.term
AND r1.doc > r2.doc
AND r1.type = r2.type
AND (r1.content > 0 or r1.title > 0 or r1.taxonomy > 0 or r1.tag > 0)
AND (r2.content > 0 or r2.title > 0 or r2.taxonomy > 0 or r2.tag > 0)
LEFT JOIN (
    SELECT doc, COUNT( term ) AS total
    FROM  `wp_relevanssi` 
    GROUP BY doc
) c1 
ON r1.doc = c1.doc
LEFT JOIN (
    SELECT doc, COUNT( term ) AS total
    FROM  `wp_relevanssi` 
    GROUP BY doc
) c2 
ON r2.doc = c2.doc
GROUP BY r1.doc, r2.doc
HAVING ScorePct >50
ORDER BY ScorePct DESC

My issue is those big ol' dodgy subqueries dropped into the joins. I think I need at least one subquery to do this (essentially, get the total amount of terms for a particular doc), because after that first LEFT JOIN we only have info about matching terms in the main query, having discarded the non-matching ones. (Please go ahead and tell me I'm wrong here, I'd love to find out a subquery is unnecessary).

Apart from that, is there a way for me to do this with a single subquery, or to otherwise improve the performance of this query? I totally expect it to be a very heavy query, I don't have any misgivings about that, but I'd like to get it running as nicely as possible.

EDIT: So I've just had to resolve this with a different approach - by looking at a single document at a time (as that document is changed) I can simplify the query to:

SELECT r1.doc, r2.doc, count(*) AS matches
FROM  `wp_relevanssi` r1
INNER JOIN  `wp_relevanssi` r2 
ON r1.term = r2.term
AND r1.doc <> r2.doc
AND r1.type = r2.type
AND (r1.content > 0 or r1.title > 0 or r1.taxonomy > 0 or r1.tag > 0)
AND (r2.content > 0 or r2.title > 0 or r2.taxonomy > 0 or r2.tag > 0)
WHERE r1.doc = %d
GROUP BY r1.doc, r2.doc
ORDER BY matches DESC
LIMIT 0,10

Which runs in a reasonable time even with 650,000 rows, and follow up with a :

SELECT doc, COUNT( term ) AS total
FROM  `wp_relevanssi` 
WHERE doc IN (%d,%d,%d...)
GROUP BY doc

then do the rest of the score match outside the DB.

Rick James · Accepted Answer

COUNT(term) implies that you need to test term for being NOT NULL. If not, then simply say COUNT(*).
Your LEFT JOINs seem to be identical; what gives? See below.
JOIN ( SELECT ... ) optimizes poorly when you have more than one of them.
LEFT implies that the 'table' on the 'right' might be missing rows, but you want NULLs in that case. Do you need that?
"Prefix" indexes (KEY terms (term(20))) are rarely beneficial, and often prevent the use of the index. Remove the (20).
InnoDB tables should have an explicit PRIMARY KEY. The UNIQUE key you have could be turned into it.
This query seems to be O(N*N). That is, it will rapidly (ie, quadratically) get slower as you increase the number of rows (N) in wp_relevanssi.

For the dup subquery, consider the following and use term_counts in the two places.

CREATE TABLE term_counts (
    PRIMARY KEY(doc)
)
    SELECT doc,
           COUNT( term ) AS total
        FROM  `wp_relevanssi` 
        GROUP BY doc;

Because of this

(r1.content > 0 or r1.title > 0 or r1.taxonomy > 0 or r1.tag > 0)

You should consider copying all rows that fail that filtering into another table, then work with that table.

Because of

ON r1.term = r2.term
AND r1.doc > r2.doc
AND r1.type = r2.type

I agree with

INDEX(term, type, doc)

(doc must be last, term and type can be in either order.)

Get Non-Aggregate data in Aggregate query

Answers (1)

Related Questions