user3769475
user3769475

Reputation: 13

Counting most common word in SQL table with exclusions

I'm trying to count the most common words from a table full of text (strings) in a MySQL database (using MYSQL workbench). I got this code working from reading another post (written by Kickstart). This code uses a separate table called integer with 10 columns from 0 to 9 for counting.

Table Schema for the main table. I'm mainly only interested in data the "Text" column.

'Id', 'int(11)', 'NO', 'PRI', '0', ''
'PostId', 'int(11)', 'YES', 'MUL', NULL, ''
'Score', 'int(11)', 'YES', 'MUL', NULL, ''
'Text', 'varchar(4000)', 'YES', '', NULL, ''
'CreationDate', 'varchar(25)', 'YES', '', NULL, ''
'UserId', 'int(11)', 'YES', 'MUL', NULL, ''
'UserDisplayName', 'varchar(255)', 'YES', '', NULL, ''

SQL query:

SELECT aWord, COUNT(*) AS WordOccuranceCount
FROM (SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(concat(Text, ' '), ' ', aCnt), ' ', -1) AS aWord
FROM table
CROSS JOIN (
SELECT a.i+b.i*10+c.i*100 + 1 AS aCnt
FROM integers a, integers b, integers c) Sub1
WHERE (LENGTH(Body) + 1 - LENGTH(REPLACE(Text, ' ', ''))) >= aCnt) Sub2
WHERE Sub2.aWord != ''
GROUP BY aWord
ORDER BY WordOccuranceCount DESC
LIMIT 10

It lists out the top 10 words, but they are full of short words like a, the, you, me... etc. How can I change it to skip certain words like those?
How can I make it so that say, only words 5 characters and up are counted?

Schema of integers table 'i', 'int(11)', 'NO', 'PRI', NULL, ''

Original post and code taken from this post. I am new and couldn't post anything on it so I had to ask here.
determining most used set of words php mysql

Thank you so much for your help!

Upvotes: 1

Views: 2698

Answers (1)

VBlades
VBlades

Reputation: 2251

You should be able to just add another condition to your WHERE clause:

SELECT aWord, COUNT(*) AS WordOccuranceCount
FROM (SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(concat(Text, ' '), ' ', aCnt), ' ', -1) AS aWord
FROM table
CROSS JOIN (
SELECT a.i+b.i*10+c.i*100 + 1 AS aCnt
FROM integers a, integers b, integers c) Sub1
WHERE (LENGTH(Body) + 1 - LENGTH(REPLACE(Text, ' ', ''))) >= aCnt) Sub2
WHERE Sub2.aWord != '' AND
      LENGTH(Sub2.aWord) >= 5
GROUP BY aWord
ORDER BY WordOccuranceCount DESC
LIMIT 10

Just checking to see if the length of aWord is at least 5 chars, and if so, include it in the result set. The LIMIT will be applied to the result set (post-filtering) and you should have what you need.

Upvotes: 1

Related Questions