Matt
Matt

Reputation: 5168

SQL count the occurrences of words from one table in second table

I am attempting to use a dialect of SQL to parse a set of records. Specifically I want to count the number records that contain various ids (words) occur in my logs. I also want to count the total number of times these ids are present.

The data is present in two different tables:

See SQL Fiddle!

id_status

This table contains the ids and their current status.

id, status
sessions, ACTIVE
visits, DEPRECATED
duration, ACTIVE
...

request_records

Sentence record may contain over 300k records:

request
example.com/api?foo=sessions
example.com/api?bar=session%2Cvisits,foo=sessions
example.com/api?bar=duration,visits
example.com/api?foo=sessions
example.com/api?foo=visits,bar=visits
...

Goal

I want to create a table containing four columns id,status,occurence_count,record_count.

As a result from the above example I would generate the following table:

id, status, occurence_count, recourd_count
sessions,ACTIVE,3,2
visits,DEPRECATED,4,3
duration,ACTIVE,1,1

SQL

Is there a way to construct an SQL query which will return the desired table? I would know how to do it fairly readily in Python or any other language but if possible I'd like to write a SELECT statement to generate this table.

Update: As suggested below, I have attempted to execute the following:

SELECT ids.id, ids.status,
     SUM(length(request) - LENGTH(replace(request, ids.id, ''))) / LENGTH(ids.id) as occurence_count,
     COUNT(reqs.request) AS recourd_count
FROM id_status ids LEFT JOIN
     request_records reqs
     ON find_in_set(ids.id, reqs.requests) > 0
GROUP BY ids.id, ids.status;

However my dialect of SQL does not permit this type of join and returns the following error:

ON clause must be AND of = comparisons of one field name from each table, with all field names prefixed with table name.

Added SQL Fiddle with Sample Schema:

CREATE TABLE id_status
    (`id` varchar(32), `status` varchar(32))
;

INSERT INTO id_status
    (`id`, `status`)
VALUES
    ('sessions', 'ACTIVE'),
    ('visits', 'DEPRECATED'),
    ('duration', 'ACTIVE')
;

CREATE TABLE request_records 
    (`request` varchar(500))
;

Upvotes: 1

Views: 885

Answers (2)

Steve Chambers
Steve Chambers

Reputation: 39444

Simple version:

SELECT id,
       status,
       (SELECT (SUM(CHAR_LENGTH(request)) -
                SUM(CHAR_LENGTH(REPLACE(request, id, ''))))
               / CHAR_LENGTH(id) FROM request_records) AS occurrence_count,
       (SELECT COUNT(*)
        FROM request_records
        WHERE INSTR(request, id) > 0) AS record_count
FROM id_status

See updated SQL fiddle.

This just looks for the word being present anywhere so for example would blindly count abcsessions or sessionsxyz when looking for sessions. If something more complex is needed (e.g. considering word boundaries), you might need to use a regular expression replacer rather than the REPLACE function - but unfortunately this is one thing MySQL lacks. I once had a go at writing one but it's not trivial - see here.

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1270623

You can generate what you want with a query. It will not be efficient:

select w.word, w.status,
       sum(length(sentence) - length(replace(sentence, w.word, ''))) / length(w.word) as cnt,
       count(s.sentence) as num_occurrences
from words w left join
     sentences s
     on find_in_set(w.word, s.sentence) > 0
group by w.word, w.status;

Upvotes: 3

Related Questions