Reputation: 5168
I am attempting to use a dialect of SQL to parse a set of records. Specifically I want to count the number records that contain various ids (words) occur in my logs. I also want to count the total number of times these ids are present.
The data is present in two different tables:
id
,status
.id
might be present more than once in a given record.See SQL Fiddle!
This table contains the ids and their current status.
id, status
sessions, ACTIVE
visits, DEPRECATED
duration, ACTIVE
...
Sentence record may contain over 300k records:
request
example.com/api?foo=sessions
example.com/api?bar=session%2Cvisits,foo=sessions
example.com/api?bar=duration,visits
example.com/api?foo=sessions
example.com/api?foo=visits,bar=visits
...
I want to create a table containing four columns id,status,occurence_count,record_count
.
occurence_count
: should be the total number of times an id occurs in all the records.recourd_count
: should be the the total number of records a particular id
occurs in.As a result from the above example I would generate the following table:
id, status, occurence_count, recourd_count
sessions,ACTIVE,3,2
visits,DEPRECATED,4,3
duration,ACTIVE,1,1
Is there a way to construct an SQL query which will return the desired table? I would know how to do it fairly readily in Python or any other language but if possible I'd like to write a SELECT
statement to generate this table.
Update: As suggested below, I have attempted to execute the following:
SELECT ids.id, ids.status,
SUM(length(request) - LENGTH(replace(request, ids.id, ''))) / LENGTH(ids.id) as occurence_count,
COUNT(reqs.request) AS recourd_count
FROM id_status ids LEFT JOIN
request_records reqs
ON find_in_set(ids.id, reqs.requests) > 0
GROUP BY ids.id, ids.status;
However my dialect of SQL does not permit this type of join and returns the following error:
ON clause must be AND of = comparisons of one field name from each table, with all field names prefixed with table name.
Added SQL Fiddle with Sample Schema:
CREATE TABLE id_status
(`id` varchar(32), `status` varchar(32))
;
INSERT INTO id_status
(`id`, `status`)
VALUES
('sessions', 'ACTIVE'),
('visits', 'DEPRECATED'),
('duration', 'ACTIVE')
;
CREATE TABLE request_records
(`request` varchar(500))
;
Upvotes: 1
Views: 885
Reputation: 39444
Simple version:
SELECT id,
status,
(SELECT (SUM(CHAR_LENGTH(request)) -
SUM(CHAR_LENGTH(REPLACE(request, id, ''))))
/ CHAR_LENGTH(id) FROM request_records) AS occurrence_count,
(SELECT COUNT(*)
FROM request_records
WHERE INSTR(request, id) > 0) AS record_count
FROM id_status
See updated SQL fiddle.
This just looks for the word being present anywhere so for example would blindly count abcsessions
or sessionsxyz
when looking for sessions
. If something more complex is needed (e.g. considering word boundaries), you might need to use a regular expression replacer rather than the REPLACE
function - but unfortunately this is one thing MySQL lacks. I once had a go at writing one but it's not trivial - see here.
Upvotes: 1
Reputation: 1270623
You can generate what you want with a query. It will not be efficient:
select w.word, w.status,
sum(length(sentence) - length(replace(sentence, w.word, ''))) / length(w.word) as cnt,
count(s.sentence) as num_occurrences
from words w left join
sentences s
on find_in_set(w.word, s.sentence) > 0
group by w.word, w.status;
Upvotes: 3