User
User

Reputation: 24731

How to find most popular word occurrences in MySQL?

I have a table called results with 5 columns.

I'd like to use the title column to find rows that are say: WHERE title like '%for sale%' and then listing the most popular words in that column. One would be for and another would be sale but I want to see what other words correlate with this.

Sample data:

title
cheap cars for sale
house for sale
cats and dogs for sale
iphones and androids for sale
cheap phones for sale
house furniture for sale

Results (single words):

for    6
sale    6
cheap    2
and    2
house    2
furniture 1
cars    1
etc...

Upvotes: 15

Views: 4635

Answers (6)

Grzegorz Adam Kowalski
Grzegorz Adam Kowalski

Reputation: 5565

You can use ExtractValue in some interesting way. See SQL fiddle here: http://sqlfiddle.com/#!9/0b0a0/45

We need only one table:

CREATE TABLE text (`title` varchar(29));

INSERT INTO text (`title`)
VALUES
    ('cheap cars for sale'),
    ('house for sale'),
    ('cats and dogs for sale'),
    ('iphones and androids for sale'),
    ('cheap phones for sale'),
    ('house furniture for sale')
;

Now we construct series of selects which extract whole words from text converted to XML. Each select extracts N-th word from the text.

select words.word, count(*) as `count` from
(select ExtractValue(CONCAT('<w>', REPLACE(title, ' ', '</w><w>'), '</w>'), '//w[1]') as word from `text`
union all
select ExtractValue(CONCAT('<w>', REPLACE(title, ' ', '</w><w>'), '</w>'), '//w[2]') from `text`
union all
select ExtractValue(CONCAT('<w>', REPLACE(title, ' ', '</w><w>'), '</w>'), '//w[3]') from `text`
union all
select ExtractValue(CONCAT('<w>', REPLACE(title, ' ', '</w><w>'), '</w>'), '//w[4]') from `text`
union all
select ExtractValue(CONCAT('<w>', REPLACE(title, ' ', '</w><w>'), '</w>'), '//w[5]') from `text`) as words
where length(words.word) > 0
group by words.word
order by `count` desc, words.word asc

Upvotes: 7

Gordon Linoff
Gordon Linoff

Reputation: 1269443

You can extract words with some string manipulation. Assuming you have a numbers table and that words are separated by single spaces:

select substring_index(substring_index(r.title, ' ', n.n), ' ', -1) as word,
       count(*)
from results r join
     numbers n
     on n.n <= length(title) - length(replace(title, ' ', '')) + 1
group by word;

If you don't have a numbers table, you can construct one manually using a subquery:

from results r join
     (select 1 as n union all select 2 union all select 3 union all . . .
     ) n
     . . .

The SQL Fiddle (courtesy of @GrzegorzAdamKowalski) is here.

Upvotes: 9

Grzegorz Adam Kowalski
Grzegorz Adam Kowalski

Reputation: 5565

Here is working SQL Fiddle: http://sqlfiddle.com/#!9/0b0a0/32

Let's start with two tables - one for texts and one for numbers:

CREATE TABLE text (`title` varchar(29));

INSERT INTO text
    (`title`)
VALUES
    ('cheap cars for sale'),
    ('house for sale'),
    ('cats and dogs for sale'),
    ('iphones and androids for sale'),
    ('cheap phones for sale'),
    ('house furniture for sale')
;

CREATE TABLE iterator (`index` int);

INSERT INTO iterator
    (`index`)
VALUES
    (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),
    (16),(17),(18),(19),(20),(21),(22),(23),(24),(25),(26),(27),(28),(29),(30)
;

The second table, iterator must contains numbers from 1 to N where N higher or equal to the lenght of the longest string in text.

Then, run this query:

select
  words.word, count(*) as `count`
from 
(select
  substring(concat(' ', t.title, ' '), i.index+1, j.index-i.index) as word
from
  text as t, iterator as i, iterator as j
where
    substring(concat(' ', t.title), i.index, 1) = ' '
and substring(concat(t.title, ' '), j.index, 1) = ' '
and i.index < j.index
) AS words
where
    length(words.word) > 0
and words.word not like '% %'
group by words.word
order by `count` desc, words.word asc

There are two selects. Outer one simply groups and counts single words (words of length greater than 0 and without any spaces). Inner one extracts all strings starting from any space character and ending with any other space character, so strings aren't words (despite naming this subquery words) because they can contain other spaces than starting and ending one.

Results:

word    count
for     6
sale    6
and     2
cheap   2
house   2
androids    1
cars    1
cats    1
dogs    1
furniture   1
iphones     1
phones  1

Upvotes: 0

exussum
exussum

Reputation: 18550

SQL is not well suited for this task, While possible there are limitations (the number of words for example)

a quick PHP script to do the same task may be easier to use long term (and likely quicker too)

<?php
$rows = [
    "cheap cars for sale",
    "house for sale",
    "cats and dogs for sale",
    "iphones and androids for sale",
    "cheap phones for sale",
    "house furniture for sale",
];

//rows here should be replaced by the SQL result
$wordTotals = [];
foreach ($rows as $row) {
   $words = explode(" ", $row);
    foreach ($words as $word) {
        if (isset($wordTotals[$word])) {
            $wordTotals[$word]++; 
            continue;
        }

        $wordTotals[$word] = 1;
    }
}

arsort($wordTotals);

foreach($wordTotals as $word => $count) {
    echo $word . " " . $count . PHP_EOL;
}

Output

for 6
sale 6
and 2
cheap 2
house 2
phones 1
androids 1
furniture 1
cats 1
cars 1
dogs 1
iphones 1

Upvotes: 0

Hotdin Gurning
Hotdin Gurning

Reputation: 1819

This would give you single words (Just if I understand what your single word means.):

select concat(val,' ',cnt) as result from(
    select (substring_index(substring_index(t.title, ' ', n.n), ' ', -1)) val,count(*) as cnt
        from result t cross join(
         select a.n + b.n * 10 + 1 n
         from 
                (select 0 as n union all select 1 union all select 2 union all select 3 
                        union all select 4 union all select 5 union all select 6 
                        union all select 7 union all select 8 union all select 9) a,
                (select 0 as n union all select 1 union all select 2 union all select 3 
                        union all select 4 union all select 5 union all select 6 
                        union all select 7 union all select 8 union all select 9) b
                order by n 
        ) n
    where n.n <= 1 + (length(t.title) - length(replace(t.title, ' ', '')))
    group by val
    order by cnt desc
) as x

Result should be looks like this :

Result
--------
for 6
sale 6
house 2
and 2
cheap 2
phones 1
iphones 1
dogs 1
furniture 1
cars 1
androids 1
cats 1

But if the single word you need like this :

result
-----------
for 6 sale 6 house 2 and 2 cheap 2 phones 1 iphones 1 dogs 1 furniture 1 cars 1 androids 1 cats 1

Just modify the query above to:

select group_concat(concat(val,' ',cnt) separator ' ') as result from( ...

Upvotes: 3

J&#252;rgen Steinblock
J&#252;rgen Steinblock

Reputation: 31723

Update

Idea taken from https://stackoverflow.com/a/17942691/98491

This query works on my machine (MySQL 5.7), however Sqlfiddle reports an error. The basic idea is that you should either create a table with numbers from 1 to maximum word occurence (like 4) in your field or as I did, use a UNION 1 .. 4 for simplicity.

CREATE TABLE products (
  `id` int,
  `name` varchar(45)
);

INSERT INTO products
    (`id`, `name`)
VALUES
    (1, 'for sale'),
    (2, 'for me'),
    (3, 'for you'),
    (4, 'you and me')
;

SELECT name, COUNT(*) as count FROM
(
SELECT
  product.id,
  SUBSTRING_INDEX(SUBSTRING_INDEX(product.name, ' ', numbers.n), ' ', -1) name
FROM
  (
    SELECT 1 AS n
    UNION SELECT 2
    UNION SELECT 3
    UNION SELECT 4
  ) AS numbers
  INNER JOIN products product
  ON CHAR_LENGTH(product.name)
     -CHAR_LENGTH(REPLACE(product.name, ' ', ''))>=numbers.n-1
ORDER BY
  id, n
)
AS result
GROUP BY name
ORDER BY count DESC

Result will be

for | 3
you | 2
me  | 2
and | 1
sale| 1

Upvotes: 0

Related Questions