Reputation: 6074
Running on a M1 Max macOS 12.6 with MariaDB 10.7 installed via homebrew.
I'm having trouble isolating the reason this is slower than I expected. I expected 2 to 10 minutes, but I'm seeing more than an hour. There are 57 million records, it seems to be scanning 55 million. At the bottom, I have listed the queries and the counts of each. I suspect the Using temporary; Using filesort. I have added an index on most referenced columns.
I am looking for more ideas to try to resolve this performance. If this is the best I can do, that is also a valid outcome. I am also looking at migrating to full text searching, but I am not ready for that yet.
ANALYZE results in table format:
id | select_type | table | type | possible_keys | key | key_len | ref | rows | r_rows | filtered | r_filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | RPost | index_merge | rpost_idx_6, rpost_idx_7, rpost_idx_8 | rpost_idx_6, rpost_idx_7, rpost_idx_8 | 6,6,6 | NULL | 6176002 | 55104309.00 | 100.00 | 0.02 | Using intersect( rpost_idx_6, rpost_idx_7, rpost_idx_8); Using where; Using temporary; Using filesort |
11 | DEPENDENT SUBQUERY | RPost | eq_ref | PRIMARY | PRIMARY | 4 | test.RPost.Id | 1 | 1.00 | 100.00 | 100.00 | |
10 | DEPENDENT SUBQUERY | RPost | eq_ref | PRIMARY | PRIMARY | 4 | test.RPost.Id | 1 | 1.00 | 100.00 | 100.00 | |
9 | DEPENDENT SUBQUERY | RPost | eq_ref | PRIMARY | PRIMARY | 4 | test.RPost.Id | 1 | 1.00 | 100.00 | 100.00 | |
8 | DEPENDENT SUBQUERY | History | ref | posthistory_idx_1, history_idx_2, history_idx_4 | history_idx_1 | 4 | test.RPost.Id | 1 | 2.61 | 2.98 | 0.78 | Using where |
7 | DEPENDENT SUBQUERY | History | ref | posthistory_idx_1 ,history_idx_2, history_idx_4 | history_idx_1 | 4 | test.RPost.Id | 1 | 2.61 | 50 | 96.34 | Using where |
6 | DEPENDENT SUBQUERY | ALL | NULL | NULL | NULL | NULL | 2 | 2.00 | 100.00 | 0.59 | Using where | |
12 | DERIVED | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
13 | UNION | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
NULL | UNION RESULT | <union12,13> | ALL | NULL | NULL | NULL | NULL | NULL | 2.00 | NULL | NULL | |
5 | DEPENDENT SUBQUERY | ALL | NULL | NULL | NULL | NULL | 2 | 2.00 | 100.00 | 0.12 | Using where | |
2 | DERIVED | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
3 | UNION | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
NULL | UNION RESULT | <union2,3> | ALL | NULL | NULL | NULL | NULL | NULL | 2.00 | NULL | NULL |
The query:
WITH Common as (
SELECT '% strings %' AS Value
UNION
SELECT '% tofind %'
),
MyRPost (Id, STitle, SBody, Queued, Frozen)
AS
(
SELECT
RPost.Id,
(
SELECT count(*) FROM Common
WHERE RPost.Title Like Common.Value
) AS STitle,
(
SELECT count(*) FROM Common
WHERE RPost.Body Like Common.Value
) AS SBody,
(SELECT count(History.TypeId) FROM History
WHERE History.PostId = RPost.Id
AND History.TypeId BETWEEN 1 AND 9
)%2 AS Queued,
(SELECT count(History.TypeId) FROM History
WHERE History.PostId = RPost.Id
AND History.TypeId BETWEEN 10 AND 19
)%2 AS Frozen
FROM RPost
WHERE RPost.CompletedDate IS NULL
AND RPost.CancelDate is NULL
AND RPost.PausedDate IS NULL
)
SELECT STitle, SBody, Id, Queued, Frozen,
(SELECT RPost.ParentId FROM RPost WHERE MyRPost.Id=RPost.Id) as OriginId,
(SELECT RPost.PostTypeId FROM RPost WHERE MyRPost.Id=RPost.Id) as Type,
(SELECT RPost.Title FROM RPost WHERE MyRPost.Id=RPost.Id) as Title
FROM MyRPost
WHERE (Queued+Frozen) = 0
AND (STitle+SBody) > 1
ORDER BY (STitle+SBody), Id
Table Structure:
CREATE TABLE History (
Id INT NOT NULL PRIMARY KEY,
TypeId SMALLINT NOT NULL,
PostId INT NOT NULL
);
CREATE TABLE RPost (
Id INT NOT NULL PRIMARY KEY,
Title varchar(256),
Body text NULL,
PostTypeId TINYINT NOT NULL ,
ParentId INT,
CompletedDate DATETIME,
CancelDate DATETIME,
PausedDate DATETIME
);
Counts:
Query | Count |
---|---|
select * from rpost | 56 million |
remove non-NULL CompletedDate / CancelDate / PausedDate | 55 million |
now remove (Queued+Frozen) = 0 | 15 million |
finally process likes | 8505 |
Upvotes: 0
Views: 428
Reputation: 107587
Consider refactoring the query for conditional aggregation without the nested sub-selects:
WITH Common AS (
SELECT '% strings %' AS Value
UNION ALL
SELECT '% tofind %'
),
MyRPost (Id, STitle, SBody, Queued, Frozen) AS (
SELECT
r.Id,
COUNT(ct.Value) AS STitle,
COUNT(cb.Value) AS SBody,
SUM(h.TypeId BETWEEN 1 AND 9) % 2 AS Queued,
SUM(h.TypeId BETWEEN 10 AND 19) % 2 AS Frozen
FROM RPost r
LEFT JOIN Common ct ON r.Title LIKE ct.Value
LEFT JOIN Common cb ON r.Body LIKE cb.Value
LEFT JOIN History h ON h.PostId = r.Id
WHERE r.CompletedDate IS NULL
AND r.CancelDate is NULL
AND r.PausedDate IS NULL
GROUP BY r.Id
HAVING (Queued + Frozen) = 0
AND (STitle + SBody) > 1
)
SELECT
m.STitle, m.SBody, m.Id, m.Queued, m.Frozen,
r.ParentId AS OriginId, r.PostTypeId AS Type, r.Title
FROM MyRPost m
LEFT JOIN RPost r ON m.Id = r.Id
ORDER BY (m.STitle + m.SBody), m.Id
Upvotes: 1