Stevey
Stevey

Reputation: 3

SQLite query to get similar sentences

TABLE animals

id | txt
1  | cat dog parrot pig ant wolf pigeon robin
2  | fox mole dog weasel cat swan bee
3  | duck heron dog ant dog fly moth fox

I have a string, for example cat parrot fox bee mole dog weasel and I need a percent-based result, for example:

2 | 5/7 words found, 71.4%
1 | 3/7 words found, 42.9%
3 | 2/7 words found, 28,6%

My actual solution is to query

SELECT animals_id, animals_txt FROM animals
          WHERE (INSTR(animals_txt, 'cat') > 0 OR
                 INSTR(animals_txt, 'parrot') > 0 OR
                 INSTR(animals_txt, 'fox') > 0 OR
                 INSTR(animals_txt, 'bee') > 0 OR
                 INSTR(animals_txt, 'mole') > 0 OR
                 INSTR(animals_txt, 'dog') > 0 OR
                 INSTR(animals_txt, 'weasel') > 0)
          ORDER BY random()"

Then check that keyword-array {cat|parrot|fox|bee|mole|dog|weasel} against all results and make a percent-based result-array like:

id | percentage
55 | 97
87 | 89
19 | 49

Would love to use COUNT(), but not used to it. The database has about 90,000 rows. Is there a better solution?

Upvotes: 0

Views: 70

Answers (1)

MikeT
MikeT

Reputation: 56953

Wheher or not you consider it better is up to you but perhaps consider the following which gets the desired results in one hit. Note that more specific checking is done i.e. a space before or after the word is required so catfish is not consider to be a cat.

  • Note unless I need to revisit the optician it appears that 1 has 6 not 5 hits

Example :-

 WITH f(found,id,txt) AS (
    SELECT
        CASE WHEN instr(' '||txt||' ',' cat ') > 0 THEN 1 ELSE 0 END +
        CASE WHEN instr(' '||txt||' ',' parrot ') > 0 THEN 1 ELSE 0 END +
        CASE WHEN instr(' '||txt||' ',' fox ') > 0 THEN 1 ELSE 0 END +
        CASE WHEN instr(' '||txt||' ',' bee ') > 0 THEN 1 ELSE 0 END +
        CASE WHEN instr(' '||txt||' ',' mole ') > 0 THEN 1 ELSE 0 END +
        CASE WHEN instr(' '||txt||' ',' dog ') > 0 THEN 1 ELSE 0 END +
        CASE WHEN instr(' '||txt||' ',' weasel ')> 0 THEN 1 ELSE 0 END,
        id,
        txt
    FROM animals
    )
SELECT found||'/7 words found, '|| round(found / (7.0 / 100.00),1)||'%' AS result FROM f;
  • WITH allows you to create a CTE (Common Table Expression, a temporary table as such) or multiple CTE's. The CTE has been called f which is basically the result of the query that gets the existing rows and the found row which is the accumulation of hits (key words found in the sentence). f is then used in the main query to get the number of hits and the percentage of hits.

Result is :-

enter image description here

Additional

Instead of CASE WHEN THEN ELSE END as used above the following could have been used to get the number of hits (there are probably a number ways to achieve the same result) :-

    (instr(' '||txt||' ',' cat ') > 0) + 
    (instr(' '||txt||' ',' parrot ') > 0) + 
    (instr(' '||txt||' ',' fox ') > 0) + 
    (instr(' '||txt||' ',' bee ') > 0) +
    (instr(' '||txt||' ',' mole ') > 0) +
    (instr(' '||txt||' ',' dog ') > 0) +
    (instr(' '||txt||' ',' weasel ') > 0)

More Flexible Version

Here's an adaptation where the list of animals to check cat parrot fox bee mole dog weasel can be changed allowing a list of up to 20 animals. Note that it's important that the last animal be followed by a space otherwise the last will not be considered.

WITH RECURSIVE
    /* 1st CTE keywords extracted from input variable */ 
    k(counter,kword,rest) AS 
        (
            SELECT 
                0, /* start counter 0 because  first row will have no keyword */
                '', /* dummy first keyword (required as column counts must match for UNION) */
                'cat parrot fox bee mole dog weasel ' /* <<<<<<<<< list of animals to search for */
                -- '' /* used to test to items to search for therefore commented out */
            UNION ALL SELECT counter+1,substr(rest,1,instr(rest,' ')),substr(rest,instr(rest,' ') +1) FROM k LIMIT 21
        ),
    /* 2nd CTE remove the row where the keyword is empty */
    kc AS 
        (
            SELECT counter, kword FROM k WHERE length(kword) > 0
        ),
    /* 3rd CTE count the occurences (hits) of the keywords for each row in the animals table */
    f(found,id,txt) AS 
        (
            SELECT 
                CASE WHEN (SELECT count() FROM kc) >= 1 AND instr(' '||txt||' ',(SELECT kword FROM kc WHERE counter = 1)) > 1 THEN 1 ELSE 0 END +
                CASE WHEN (SELECT count() FROM kc) >= 2 AND instr(' '||txt||' ',(SELECT kword FROM kc WHERE counter = 2)) > 1 THEN 1 ELSE 0 END +
                CASE WHEN (SELECT count() FROM kc) >= 3 AND instr(' '||txt||' ',(SELECT kword FROM kc WHERE counter = 3)) > 1 THEN 1 ELSE 0 END +
                CASE WHEN (SELECT count() FROM kc) >= 4 AND instr(' '||txt||' ',(SELECT kword FROM kc WHERE counter = 4)) > 1 THEN 1 ELSE 0 END +
                CASE WHEN (SELECT count() FROM kc) >= 5 AND instr(' '||txt||' ',(SELECT kword FROM kc WHERE counter = 5)) > 1 THEN 1 ELSE 0 END +
                CASE WHEN (SELECT count() FROM kc) >= 6 AND instr(' '||txt||' ',(SELECT kword FROM kc WHERE counter = 6)) > 1 THEN 1 ELSE 0 END +
                CASE WHEN (SELECT count() FROM kc) >= 7 AND instr(' '||txt||' ',(SELECT kword FROM kc WHERE counter = 7)) > 1 THEN 1 ELSE 0 END +
                CASE WHEN (SELECT count() FROM kc) >= 8 AND instr(' '||txt||' ',(SELECT kword FROM kc WHERE counter = 8)) > 1 THEN 1 ELSE 0 END +
                CASE WHEN (SELECT count() FROM kc) >= 9 AND instr(' '||txt||' ',(SELECT kword FROM kc WHERE counter = 9)) > 1 THEN 1 ELSE 0 END +
                CASE WHEN (SELECT count() FROM kc) >= 10 AND instr(' '||txt||' ',(SELECT kword FROM kc WHERE counter = 10)) > 1 THEN 1 ELSE 0 END +
                CASE WHEN (SELECT count() FROM kc) >= 11 AND instr(' '||txt||' ',(SELECT kword FROM kc WHERE counter = 11)) > 1 THEN 1 ELSE 0 END +
                CASE WHEN (SELECT count() FROM kc) >= 12 AND instr(' '||txt||' ',(SELECT kword FROM kc WHERE counter = 12)) > 1 THEN 1 ELSE 0 END +
                CASE WHEN (SELECT count() FROM kc) >= 13 AND instr(' '||txt||' ',(SELECT kword FROM kc WHERE counter = 13)) > 1 THEN 1 ELSE 0 END +
                CASE WHEN (SELECT count() FROM kc) >= 14 AND instr(' '||txt||' ',(SELECT kword FROM kc WHERE counter = 14)) > 1 THEN 1 ELSE 0 END +
                CASE WHEN (SELECT count() FROM kc) >= 15 AND instr(' '||txt||' ',(SELECT kword FROM kc WHERE counter = 15)) > 1 THEN 1 ELSE 0 END +
                CASE WHEN (SELECT count() FROM kc) >= 16 AND instr(' '||txt||' ',(SELECT kword FROM kc WHERE counter = 16)) > 1 THEN 1 ELSE 0 END +
                CASE WHEN (SELECT count() FROM kc) >= 17 AND instr(' '||txt||' ',(SELECT kword FROM kc WHERE counter = 17)) > 1 THEN 1 ELSE 0 END +
                CASE WHEN (SELECT count() FROM kc) >= 18 AND instr(' '||txt||' ',(SELECT kword FROM kc WHERE counter = 18)) > 1 THEN 1 ELSE 0 END +
                CASE WHEN (SELECT count() FROM kc) >= 19 AND instr(' '||txt||' ',(SELECT kword FROM kc WHERE counter = 19)) > 1 THEN 1 ELSE 0 END +
                CASE WHEN (SELECT count() FROM kc) >= 20 AND instr(' '||txt||' ',(SELECT kword FROM kc WHERE counter = 20)) > 1 THEN 1 ELSE 0 END
                ,id,txt
            FROM animals
        )
/* Output the results as per hits from the final CTE */
SELECT 
    *, 
    ifnull(
        found||' out of '
            ||(SELECT count() FROM kc)
            ||' words found, '
            || round(found / (CAST((SELECT count() FROM kc) AS REAL) / 100.00),1) 
            ||'%',

        /* handle no animals to search for */
        'No Animals to search for!'
    )
    AS result 
FROM f 
ORDER BY random()
;

Results

Note that 3 extra rows have been added to the animals table one is an empty string '', the second is NULL, the third is a list of animals none of which are being looked for as per 'kcud noreh god tna god ylf htom xof'.

Animal list to search for is as per the question (added space) i.e. 'cat parrot fox bee mole dog weasel ', this results in :-

enter image description here

If the list to search for is changed say to 'cat parrot fox bee mole dog weasel snake leopard elephant moose flea unicorn gnat butterfly flutterby gutterfly horse hare rabbit giraffe ' /* <<<<<<<<< list of animals to search for */ 21 animals then the result is :-

enter image description here

  • Only 20 animals have been searched for due to a) the LIMIT 21 (the very first row will be an empty keyword so it will be dropped by CTE2) and then b) the CASE statements only cater for 20 keywords.

Upvotes: 1

Related Questions