Reputation: 22162
I have a performance problem with a table I'm working on and it seems I'm unable to find a good solution to this issue. I create the due indexes but we have millions rows and query is still very slow.
The table represents a text splitted in token with other information for each token. I know that some of you may think this could have been done using a Full-text search engine but we can't. Please, trust me.
The table schema is the following:
CREATE TABLE `midia_lemmatized_text`
(
`IdFile` CHAR(15) NOT NULL,
`Position` INTEGER NOT NULL,
`WordForm` VARCHAR(48) NOT NULL,
`Pos` VARCHAR(16) NOT NULL,
`Lemma` VARCHAR(64) NOT NULL,
PRIMARY KEY (`IdFile`,`Position`),
INDEX `midia_lemmatized_text_FI_2` (`Pos`),
INDEX `midia_lemmatized_text_FI_3` (`WordForm`),
CONSTRAINT `midia_lemmatized_text_FK_1`
FOREIGN KEY (`IdFile`)
REFERENCES `midia_metadata` (`Id`),
CONSTRAINT `midia_lemmatized_text_FK_2`
FOREIGN KEY (`Pos`)
REFERENCES `midia_pos` (`Pos`)
) ENGINE=InnoDB CHARACTER SET='utf8';
where
IdFile
is an external keyPosition
is a index position that specify the position of the current token inside the fileWordForm
is the token itselfPoS
is the Part-of-Speech of the word formLemma
is the lemma of the word formExample of row:
1, 1, 'The', 'ART', 'The'
1, 2, 'table', 'NOUN', 'table'
1, 3, 'is', 'VER', 'be'
...
The problematic query is like the following:
Find all the word forms that are "rivolgimento" in context, that is surrounded by the previous and following 10 words
Note: 10 can be another number and a context word is also a comma, dot and so on.
An example of the result is:
cuor trasparente , mi par bene di conchiuder con affettuoso rivolgimento alla dissimulazione stessa . O virtù , che sei il
What I do now is retrieve all the IdFile
and Position
numbers for each matched row and then loop over them to retrieve the previous and following N words. As you can understand, this implies 1 + N queries and for a large N it leads to a very slow response.
The main problem is also that people can also search using REGEX on the column and this slow even more the query.
I thought to use GROUP_CONCAT but don't know exactly how.
Upvotes: 0
Views: 58
Reputation: 22162
Based on the hints provided by Gordon Linoff, I came out with this query:
select ltr.*, group_concat(lt.WordForm separator ' ')
from midia_lemmatized_text ltr join
midia_lemmatized_text lt
on ltr.`WordForm` = 'rivolgimento' and
lt.position between ltr.position - 10 and ltr.position + 10 and
lt.IdFile = ltr.IdFile
group by ltr.IdFile, ltr.position;
Upvotes: 0
Reputation: 1269843
The way that you would get 10 words around "rivolgimento":
select lt.*
from lemmatized_text ltr join
lemmatized_text lt
on ltr.lemma = 'rivolgimento' and
lt.position between ltr.position - 10 and ltr.position + 10;
If you want the words in one row, per appearance of 'rivolgimento', then something like:
select ltr.position, group_concat(lt.lemma separator ' ')
from lemmatized_text ltr join
lemmatized_text lt
on ltr.lemma = 'rivolgimento' and
lt.position between ltr.position - 10 and ltr.position + 10
group by ltr.position;
Upvotes: 1