Reputation: 4032
I have a table where I am storing queries executed by users with some metadata. The table expires after 1 day.
CREATE TABLE '_query_trace' (
ts TIMESTAMP,
query_text VARCHAR,
execution_micros LONG,
principal VARCHAR
) timestamp(ts) PARTITION BY HOUR TTL 1 DAY BYPASS WAL
WITH maxUncommittedRows=500000, o3MaxLag=5000000us;
I have created a materialized view, to keep hourly statistics of execution per user and query
create materialized view query_stats AS (
select ts, principal,query_text, count() as executions, sum(execution_micros) as execution_micros, avg(execution_micros) as avg_execution_micros
FROM queries
SAMPLE BY 1h
) PARTITION BY DAY;
The problem I have with this view is that the query_text
column can be very large. About a third of the queries have over 4000 characters, so I was thinking what would be a good solution for this. I discarded using SYMBOL
as very often the queries are identical (querying by relative time-ranges like WHERE timestamp in today()
, but some dashboard tools, like grafana, interpolate automatically absolute time in the queries, which makes the values potentially very high cardinality.
Any ideas to work around this?
Upvotes: 0
Views: 14
Reputation: 4032
A potential solution is hashing the query text directly in the view, as in
create materialized view query_stats AS (
select ts, principal, md5(query_text) as query_hash, count() as executions, sum(execution_micros) as execution_micros, avg(execution_micros) as avg_execution_micros
FROM queries
SAMPLE BY 1h
) PARTITION BY DAY;
This way, all the queries will have always 32 characters. If we want to just keep track of executions relative to users, this could suffice. If we want to have the option to match with the real text of the query, we could create a second view:
create materialized view query_stats_text AS (
select ts, md5(query_text) as query_hash, query_text
FROM queries
SAMPLE BY 100y
) PARTITION BY YEAR;
Note that QuestDB forces materialized views to be sampled, so I am using SAMPLE BY 100y
, to force all the queries to have the epoch timestamp, so we don't have multiple entries per query for different sampling intervals.
Now it is possible to do
select qs.ts, qt.query_text, qs.principal, qs.executions from query_stats qs join query_stats_text qt ON (query_hash);
Upvotes: 0