Reputation: 13
I'm trying to count the most common words from a table full of text (strings) in a MySQL database (using MYSQL workbench). I got this code working from reading another post (written by Kickstart). This code uses a separate table called integer with 10 columns from 0 to 9 for counting.
Table Schema for the main table. I'm mainly only interested in data the "Text" column.
'Id', 'int(11)', 'NO', 'PRI', '0', ''
'PostId', 'int(11)', 'YES', 'MUL', NULL, ''
'Score', 'int(11)', 'YES', 'MUL', NULL, ''
'Text', 'varchar(4000)', 'YES', '', NULL, ''
'CreationDate', 'varchar(25)', 'YES', '', NULL, ''
'UserId', 'int(11)', 'YES', 'MUL', NULL, ''
'UserDisplayName', 'varchar(255)', 'YES', '', NULL, ''
SQL query:
SELECT aWord, COUNT(*) AS WordOccuranceCount
FROM (SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(concat(Text, ' '), ' ', aCnt), ' ', -1) AS aWord
FROM table
CROSS JOIN (
SELECT a.i+b.i*10+c.i*100 + 1 AS aCnt
FROM integers a, integers b, integers c) Sub1
WHERE (LENGTH(Body) + 1 - LENGTH(REPLACE(Text, ' ', ''))) >= aCnt) Sub2
WHERE Sub2.aWord != ''
GROUP BY aWord
ORDER BY WordOccuranceCount DESC
LIMIT 10
It lists out the top 10 words, but they are full of short words like a, the, you, me... etc.
How can I change it to skip certain words like those?
How can I make it so that say, only words 5 characters and up are counted?
Schema of integers table 'i', 'int(11)', 'NO', 'PRI', NULL, ''
Original post and code taken from this post. I am new and couldn't post anything on it so I had to ask here.
determining most used set of words php mysql
Thank you so much for your help!
Upvotes: 1
Views: 2698
Reputation: 2251
You should be able to just add another condition to your WHERE clause:
SELECT aWord, COUNT(*) AS WordOccuranceCount
FROM (SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(concat(Text, ' '), ' ', aCnt), ' ', -1) AS aWord
FROM table
CROSS JOIN (
SELECT a.i+b.i*10+c.i*100 + 1 AS aCnt
FROM integers a, integers b, integers c) Sub1
WHERE (LENGTH(Body) + 1 - LENGTH(REPLACE(Text, ' ', ''))) >= aCnt) Sub2
WHERE Sub2.aWord != '' AND
LENGTH(Sub2.aWord) >= 5
GROUP BY aWord
ORDER BY WordOccuranceCount DESC
LIMIT 10
Just checking to see if the length of aWord is at least 5 chars, and if so, include it in the result set. The LIMIT will be applied to the result set (post-filtering) and you should have what you need.
Upvotes: 1