Reputation:
One of the three alternatives of what to store as a data entry in an index is a data entry k* which is an actual record with search key value k. My question is, if you're going to store the actual record in the index, then what's the point of creating the index?
Upvotes: 2
Views: 4723
Reputation: 7880
This is an extreme case, because it does not really correspond to having data entries separated by data records (the hashed file is an example of this case).
(M. Lenzerini, R. Rosati, Database Management Systems: Access file manager and query evaluation, "La Sapienza" University of Rome, 2016)
Alternative 1 is often used for direct indexing, for instance in B-trees and hash indexes (see also Oracle, Building Domain Indexes)
Let's do a concrete example.
We have a relation R(a,b,c)
and we have a clustered B+-tree using alternative 2 on search key a
. Since the tree is clustered, the relation R
must be sorted by a
.
Now, let's suppose that a common query for the relation is:
SELECT *
FROM R
WHERE b > 25
so we want to build another index to efficiently support this kind of query.
We know that clustered B+-trees with alternative 2 are efficient with range queries, because they needs just to search for the first good result (say the one with b=25
), then do 1 page access to the relation's page to which this result points, and finally scan that page (and eventually, some other pages) until the records fall within the given range.
To sum up:
The final cost (expressed in terms of page accesses) is
logƒ(ℓ) + 1 + #relevant-pages
where ƒ
is the fan-out and ℓ
the number of leaves.
Unfortunately, in our case a tree on search key b
must be unclustered, because the relation is already sorted by a
We also know that B+-trees are not so efficient in range queries when they are unclustered. Infact, having a tree with alternative 2 or 3, in the tree we'd store only the pointers to the records, so for each result that falls in the range we'd have to do a page access to a potential different page (because the relation has a different order with respect to the index).
To sum up:
The final cost (expressed in terms of page accesses) is
logƒ(ℓ) + #other-relevant-leaves + #relevant-tuples
notice that the number of tuples is pretty bigger respect to the number of pages!
Using alternative 1, we have all the data in the tree, so for executing the query we:
The final cost (expressed in terms of page accesses) is
logƒ(ℓ) + #other-relevant-leaves
that is even smaller than (or at most equal to) the cost of case 1, but this instead is allowed.
I hope I was clear enough.
N.B. The cost is expressed in terms of page accesses because the I/O operations from/to second-storage are the most expensive ones in terms of time (we ignore the cost of scanning a whole page in main memory but we consider just the cost of accessing it).
Upvotes: 2