Kjetil Hamre
Kjetil Hamre

Reputation: 96

How optimize for both duplicates and summary queries on adx

We are ingesting simple device readings on regular intervals into adx. However, 0.28% of the ingested readings may be later updated with new reading values. This can be rectified with a simple Materialized View and arg_max. However, we also want to optimize summaries of readings per day, month and year. The problem being, I cannot make a materialized view on top of a materialized view. It is also not possible to make a materialized view with two summary clauses.

Example:

.create table Readings (Timestamp:datetime, DeviceName:string, IngestTime:datetime, Reading:decimal)

.ingest inline into table Readings <|
    "2022-10-31 23:00:00.0000000", "EX", "2022-11-06 11:02:29.5690000",0.733 
    "2022-10-31 22:00:00.0000000", "EX", "2022-11-05 11:05:36.4230000",0.763
    "2022-10-31 22:00:00.0000000", "EX", "2022-11-01 07:08:55.9580000",0.5
    "2022-10-31 22:00:00.0000000", "EX", "2022-11-02 11:04:42.7050000",0.5
    "2022-10-31 21:00:00.0000000", "EX", "2022-11-05 11:05:36.4230000",0.856
    "2022-10-31 20:00:00.0000000", "EX", "2022-11-05 11:05:36.4230000",0.827
    "2022-10-31 20:00:00.0000000", "EX", "2022-11-02 11:04:42.7050000",0
    "2022-10-31 20:00:00.0000000", "EX", "2022-11-01 07:08:55.9580000",0
    "2022-10-31 19:00:00.0000000", "EX", "2022-11-05 11:05:36.4230000",0.935

.create materialized-view with (backfill=true, DocString="Only latest measures by arg_max", 
        effectiveDateTime=datetime(2019-01-01), 
        MaxSourceRecordsForSingleIngest=10000000, 
        Concurrency=5 
) ReadingsLatest on table Readings { Readings 
| summarize arg_max(IngestTime, *) by DeviceName, Reading } 

gives output

2022-10-31T23:00:00Z     "EX"  2022-11-06T11:02:29.569Z   0.733
2022-10-31T22:00:00Z    "EX"  2022-11-05T11:05:36.423Z   0.763
2022-10-31T21:00:00Z    "EX"  2022-11-05T11:05:36.423Z   0.856
2022-10-31T20:00:00Z    "EX"  2022-11-05T11:05:36.423Z   0.827
2022-10-31T19:00:00Z    "EX"  2022-11-05T11:05:36.423Z   0.935

The problem is performance when we perform aggregate queries:

ReadingsLatest
| summarize Reading_Day = sum(Reading) by Day = startofday(datetime_utc_to_local(Timestamp, 'Europe/Oslo')), DeviceName

So we would like a Materialized view on that, but we cannot figure out how:

.create materialized-view with (backfill=true, DocString="Only latest measures by arg_max", 
        effectiveDateTime=datetime(2019-01-01), 
        MaxSourceRecordsForSingleIngest=10000000, 
        Concurrency=5 
) ReadingsLatestDay on materialized-view ReadingsLatest { ReadingsLatest 
| summarize Reading_Day = sum(Reading) by Day = startofday(datetime_utc_to_local(Timestamp, 'Europe/Oslo')), DeviceName }

//fails with:
Cannot create materialized view 'ReadingsLatestDay': Materialized view can only be created on top of another materialized view which includes a single any()/anyif()/take_any()/take_anyif() aggregation.

Trying both at the same time:

.create materialized-view with (backfill=true, DocString="Only latest measures by arg_max", 
        effectiveDateTime=datetime(2019-01-01), 
        MaxSourceRecordsForSingleIngest=10000000, 
        Concurrency=5 
) ReadingsLatestDay on table Readings { Readings 
| summarize arg_max(IngestTime, *) by DeviceName, Reading  
| summarize Reading_Day = sum(Reading) by Day = startofday(datetime_utc_to_local(Timestamp, 'Europe/Oslo')), DeviceName }

//fails with:
Cannot create materialized view 'ReadingsLatestDay': Materialized views query can only include a single summarize operator over the source table.

Another option we have considered is deleting the Readings that have been updated in a later ingest, but this also seems to be difficult. We cannot figure out the syntax.

Upvotes: 3

Views: 286

Answers (1)

yifats
yifats

Reputation: 2744

The errors you are getting are expected - materialized view over materialized view is only supported when the first view is of type take_any(*) - see docs here. More than a single aggregation in the same materialized view also isn't supported. There is no build in way to pre-compute both of these aggregates in ADX. You can create the first materialized view (arg_max()) and then orchestrate your own pipeline, using any of the orchestration tools, that periodically queries the materialized view and persists the daily aggregates, using ingest from query commands.

Upvotes: 2

Related Questions