Reputation: 5766
I'm trying to figure out how to go about determining the most used words on a mysql dataset.
Not sure how to go about this or if there's a simpler approach. Read a couple posts where some suggests an algorithm.
Example:
From 24,500 records, find out the top 10 used words.
Upvotes: 7
Views: 3313
Reputation: 93
SELECT `COLUMNNAME`, COUNT(*) FROM `TABLENAME` GROUP BY `COLUMNNAME`
its very simple and worked... :)
Upvotes: 1
Reputation: 181
A little improve, remove stop words from the list with AND Sub2.aWord not in (list of stop words)
SELECT aWord, COUNT(*) AS WordOccuranceCount
FROM (SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(concat(txt_msg, ' '), ' ', aCnt), ' ', -1) AS aWord
FROM mensagens
CROSS JOIN (
SELECT a.i+b.i*10+c.i*100 + 1 AS aCnt
FROM integers a, integers b, integers c) Sub1
WHERE (LENGTH(txt_msg) + 1 - LENGTH(REPLACE(txt_msg, ' ', ''))) >= aCnt) Sub2
WHERE Sub2.aWord != '' AND Sub2.aWord not in ('a','about','above', .....)
GROUP BY aWord
ORDER BY WordOccuranceCount DESC
LIMIT 10
Upvotes: 0
Reputation: 3286
Why not do it all in PHP? Steps would be
I wouldn't do it in SQL mainly because it'd end up more complex.
Upvotes: 4
Reputation: 446
General idea would be to figure out how many delimiters (e.g. spaces) are in each field, and run SUBSTRING_INDEX()
in a loop, for each such field. Populating this into a temporary table has the added benefit of being able to run this in chunks, in parallel, etc. Shouldn't be too cumbersome to throw some SPs together to do this.
Upvotes: 1
Reputation: 21513
Right, this runs like a dog and is limited to working with a single delimiter, but hopefully will give you an idea.
SELECT aWord, COUNT(*) AS WordOccuranceCount
FROM (SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(concat(SomeColumn, ' '), ' ', aCnt), ' ', -1) AS aWord
FROM SomeTable
CROSS JOIN (
SELECT a.i+b.i*10+c.i*100 + 1 AS aCnt
FROM integers a, integers b, integers c) Sub1
WHERE (LENGTH(SomeColumn) + 1 - LENGTH(REPLACE(SomeColumn, ' ', ''))) >= aCnt) Sub2
WHERE Sub2.aWord != ''
GROUP BY aWord
ORDER BY WordOccuranceCount DESC
LIMIT 10
This relies on having a table called integers with a single column called i with 10 rows with the values 0 to 9. It copes with up to ~1000 words but can easily be altered to cope with more (but will slow down even more).
Upvotes: 14