Reputation: 2202
I have data that looks like this:
movie_id comment
1 tom cruise is great
1 great action movie
2 got teary eyed
2 great cast
1 tom cruise is hott
I'd like a function that returns the most common words in the comments, based on what movie_id I select. So if I'm querying movie_id=1, I'd get:
tom, 2
cruise, 2
is, 2
great, 2
hott, 1
action, 1
movie, 1
While if I query movie_id=2, I'd get:
got, 1
teary, 1
eyed, 1
great, 1
cast, 1
I saw some solutions using tsql, but I've never used that before and didn't understand the code. Looking for a way to do this in sqlite3.
Upvotes: 3
Views: 2159
Reputation: 1327
This a hack, but you can also use SQLite's Table-valued Functions to solve this:
select value from (
select
'["' || replace(trim(title), '"', '') || '"]' as arr
from story limit 2500000
) e, json_each(e.arr)
Watch out though - you might notice that you have a lot of characters in your column that may break JSON parsing and throw a JSON parsing error.
In that case, you might have to remove some control characters:
select value from (select
'["' || replace(replace(replace(replace(replace(replace(trim(title), '"', ''), CHAR(10), ""), CHAR(13), ''), " ", ""), "\", ""), ' ', '","')
|| '"]' as arr
from your_table limit 2500000
) e, json_each(e.arr)
To make it easier to debug, you can use a combo of limit and offset (eg. limit 1000000 offset 2200000
) to help you find the problem rows and then replace that character with a an empty string.
Upvotes: 0
Reputation: 1270713
You can do this with a really ugly query.
select word, count(*) from (
select (case when instr(substr(m.comments, nums.n+1), ' ') then substr(m.comments, nums.n+1)
else substr(m.comments, nums.n+1, instr(substr(m.comments, nums.n+1), ' ') - 1)
end) as word
from (select ' '||comments as comments
from m
)m cross join
(select 1 as n union all select 2 union all select 3
) nums
where substr(m.comments, nums.n, 1) = ' ' and substr(m.comments, nums.n, 1) <> ' '
) w
group by word
order by count(*) desc
This is untested. The inner query needs a list of numbers (limited to just 3 here; you can see how to add more). It then checks to see if a word starts at position n+1. A word starts after a space, so I put a space at the beginning of the comments.
Then it pulls the word out, for aggregation purposes.
Upvotes: 4