egor7
egor7

Reputation: 4940

How does kdb+ decide when to use attribute metadata in queries

Why does in some cases kdb+ decide to use attribute metadata (index), and in some cases doesn't:

n:floor 10 xexp 7;
syms:`AAPL`GOOG`IBM`MSFT`ORCL;
trade:([]time:n?.z.t;sym:n?syms;price:n?1000f);
`:2021.01.01/trade/ set .Q.en[`:.;trade];
`:2021.01.01/tradeAttr/ set .Q.en[`:.;trade];
@[`sym xasc `:2021.01.01/tradeAttr;`sym;`p#];
\l .

\t:10 select count i by sym from trade where date=2021.01.01
1074
\t:10 select count i by sym from trade where date=2021.01.01,sym in syms
2227
\t:10 select count i by sym from tradeAttr where date=2021.01.01
443
\t:10 select count i by sym from tradeAttr where date=2021.01.01,sym in syms
1

Could you explain these numbers please?

It looks like kdb+ uses direct access to attributes metadata in the last case. But why the same users intention does not detected into the 3rd query?

Upvotes: 1

Views: 214

Answers (1)

terrylynch
terrylynch

Reputation: 13612

Summarizing our comments as an actual answer:

In general it's best not to use the "virtual" columns date and i when getting counts in a partitioned table. By using "i" in a given date slice, kdb assumes that you want to know the actual total running index "i" from the beginning of time for that table. So if date slice 1 has 100k rows, in that date slice "i" ranges from 0 to 99,999. If date slice 2 has 50k rows, in that date slice the overall "i" ranges from 100,000 to 149,999 and so on. But in order to know that "i" ranges from 100,000 to 149,999 in the second slice, kdb would need to know the count of the first date slice. So it needs to also count the first date slice. In a large database, this means that it needs to count all date slices up to the date slice you're trying to get "i" for.

As egor pointed out, kdb populates this count-of-date-slices in the .Q.pn dictionary.

In this example, since you didn't actually care what the values of "i" are - you just cared about how many there are in that one date - you don't need to have kdb count all date slices. By counting any other column from the table, or by using a trick such as count 1b (which automatically fills out the 1b atom to the required length of the table), you can achieve counts within a date slice without forcing a count of all date slices.

As for the final example where kdb didn't do this after a filter on a column with an attribute: it seems like kdb prioritises that filter first before determining the count of "i"......but after that filtering the "i" is already a subset of the full days "i" so it knows that it doesn't need to build all "i"s for all date slices to get the "i"s for the current date slice. It becomes a new on-the-fly "i" at that point, relevant only to the subset table you've generated from the select.

Final note

If you did actually care about the overall running index "i" (e.g. you want to see the 1 millionth record of all time in this table since its beginning), there is a tool for that .Q.ind, see https://code.kx.com/q/ref/dotq/#qind-partitioned-index

Upvotes: 2

Related Questions