Aurelio De Rosa
Aurelio De Rosa

Reputation: 22162

How to turn these N queries to one

I have a performance problem with a table I'm working on and it seems I'm unable to find a good solution to this issue. I create the due indexes but we have millions rows and query is still very slow.

The table represents a text splitted in token with other information for each token. I know that some of you may think this could have been done using a Full-text search engine but we can't. Please, trust me.

The table schema is the following:

CREATE TABLE `midia_lemmatized_text`
(
    `IdFile` CHAR(15) NOT NULL,
    `Position` INTEGER NOT NULL,
    `WordForm` VARCHAR(48) NOT NULL,
    `Pos` VARCHAR(16) NOT NULL,
    `Lemma` VARCHAR(64) NOT NULL,
    PRIMARY KEY (`IdFile`,`Position`),
    INDEX `midia_lemmatized_text_FI_2` (`Pos`),
    INDEX `midia_lemmatized_text_FI_3` (`WordForm`),
    CONSTRAINT `midia_lemmatized_text_FK_1`
        FOREIGN KEY (`IdFile`)
        REFERENCES `midia_metadata` (`Id`),
    CONSTRAINT `midia_lemmatized_text_FK_2`
        FOREIGN KEY (`Pos`)
        REFERENCES `midia_pos` (`Pos`)
) ENGINE=InnoDB CHARACTER SET='utf8';

where

Example of row:

1, 1, 'The', 'ART', 'The'
1, 2, 'table', 'NOUN', 'table'
1, 3, 'is', 'VER', 'be'
...

The query

The problematic query is like the following:

Find all the word forms that are "rivolgimento" in context, that is surrounded by the previous and following 10 words

Note: 10 can be another number and a context word is also a comma, dot and so on.

An example of the result is:

cuor trasparente , mi par bene di conchiuder con affettuoso rivolgimento alla dissimulazione stessa . O virtù , che sei il

What I do now is retrieve all the IdFile and Position numbers for each matched row and then loop over them to retrieve the previous and following N words. As you can understand, this implies 1 + N queries and for a large N it leads to a very slow response.

The main problem is also that people can also search using REGEX on the column and this slow even more the query.

I thought to use GROUP_CONCAT but don't know exactly how.

Upvotes: 0

Views: 58

Answers (2)

Aurelio De Rosa
Aurelio De Rosa

Reputation: 22162

Based on the hints provided by Gordon Linoff, I came out with this query:

select ltr.*, group_concat(lt.WordForm separator ' ')
from midia_lemmatized_text ltr join
     midia_lemmatized_text lt
     on ltr.`WordForm` = 'rivolgimento' and
        lt.position between ltr.position - 10 and ltr.position + 10 and
        lt.IdFile = ltr.IdFile
group by ltr.IdFile, ltr.position;

Upvotes: 0

Gordon Linoff
Gordon Linoff

Reputation: 1269843

The way that you would get 10 words around "rivolgimento":

select lt.*
from lemmatized_text ltr join
     lemmatized_text lt
     on ltr.lemma = 'rivolgimento' and
        lt.position between ltr.position - 10 and ltr.position + 10;

If you want the words in one row, per appearance of 'rivolgimento', then something like:

select ltr.position, group_concat(lt.lemma separator ' ')
from lemmatized_text ltr join
     lemmatized_text lt
     on ltr.lemma = 'rivolgimento' and
        lt.position between ltr.position - 10 and ltr.position + 10
group by ltr.position;

Upvotes: 1

Related Questions