Reputation: 5766

determining most used set of words php mysql

I'm trying to figure out how to go about determining the most used words on a mysql dataset.

Not sure how to go about this or if there's a simpler approach. Read a couple posts where some suggests an algorithm.

Example:

From 24,500 records, find out the top 10 used words.

Upvotes: 7

Answers (5)

Mahdi Malekian

Reputation: 93

SELECT `COLUMNNAME`, COUNT(*) FROM `TABLENAME` GROUP BY `COLUMNNAME`

its very simple and worked... :)

Upvotes: 1

Eduardo de Souza

Reputation: 181

A little improve, remove stop words from the list with AND Sub2.aWord not in (list of stop words)

SELECT aWord, COUNT(*) AS WordOccuranceCount
FROM (SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(concat(txt_msg, ' '), ' ', aCnt), ' ', -1) AS aWord
FROM mensagens
CROSS JOIN (
SELECT a.i+b.i*10+c.i*100 + 1 AS aCnt
FROM integers a, integers b, integers c) Sub1
WHERE (LENGTH(txt_msg) + 1 - LENGTH(REPLACE(txt_msg, ' ', ''))) >= aCnt) Sub2
WHERE Sub2.aWord != '' AND Sub2.aWord not in ('a','about','above', .....)
GROUP BY aWord
ORDER BY WordOccuranceCount DESC
LIMIT 10

Upvotes: 0

Ege Akpinar

Reputation: 3286

Why not do it all in PHP? Steps would be

Create a dictionary (word => count)
Read you data in PHP
Split it into words
Add each word to the dictionary (you might want to lowercase and trim them first)
If already in the dictionary, increment its count. If not already in the dictionary, set 1 as its value (count = 1)
Iterate your dictionary elements to find the highest 10 values

I wouldn't do it in SQL mainly because it'd end up more complex.

Upvotes: 4

fenway

Reputation: 446

General idea would be to figure out how many delimiters (e.g. spaces) are in each field, and run SUBSTRING_INDEX() in a loop, for each such field. Populating this into a temporary table has the added benefit of being able to run this in chunks, in parallel, etc. Shouldn't be too cumbersome to throw some SPs together to do this.

Upvotes: 1

Kickstart

Reputation: 21513

Right, this runs like a dog and is limited to working with a single delimiter, but hopefully will give you an idea.

SELECT aWord, COUNT(*) AS WordOccuranceCount
FROM (SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(concat(SomeColumn, ' '), ' ', aCnt), ' ', -1) AS aWord
FROM SomeTable
CROSS JOIN (
SELECT a.i+b.i*10+c.i*100 + 1 AS aCnt
FROM integers a, integers b, integers c) Sub1
WHERE (LENGTH(SomeColumn) + 1 - LENGTH(REPLACE(SomeColumn, ' ', ''))) >= aCnt) Sub2
WHERE Sub2.aWord != ''
GROUP BY aWord
ORDER BY WordOccuranceCount DESC
LIMIT 10

This relies on having a table called integers with a single column called i with 10 rows with the values 0 to 9. It copes with up to ~1000 words but can easily be altered to cope with more (but will slow down even more).

Upvotes: 14

determining most used set of words php mysql

Answers (5)

Related Questions