Alternative 1 of index data entry

One of the three alternatives of what to store as a data entry in an index is a data entry k* which is an actual record with search key value k. My question is, if you're going to store the actual record in the index, then what's the point of creating the index?

Upvotes: 2

Answers (1)

logi-kal

Reputation: 7880

This is an extreme case, because it does not really correspond to having data entries separated by data records (the hashed file is an example of this case).

(M. Lenzerini, R. Rosati, Database Management Systems: Access file manager and query evaluation, "La Sapienza" University of Rome, 2016)

Alternative 1 is often used for direct indexing, for instance in B-trees and hash indexes (see also Oracle, Building Domain Indexes)

Let's do a concrete example.

We have a relation R(a,b,c) and we have a clustered B⁺⁠-⁠tree using alternative 2 on search key a. Since the tree is clustered, the relation R must be sorted by a.

Now, let's suppose that a common query for the relation is:

SELECT *
FROM R
WHERE b > 25

so we want to build another index to efficiently support this kind of query.

Case 1: clustered tree with alt. 2

We know that clustered B⁺⁠-⁠trees with alternative 2 are efficient with range queries, because they needs just to search for the first good result (say the one with b=25), then do 1 page access to the relation's page to which this result points, and finally scan that page (and eventually, some other pages) until the records fall within the given range.

To sum up:

Search for the first good result in the tree. Cost: log_ƒ(ℓ)
Use the found pointer to go to a specific page. Cost: 1
Scan the page and eventual other pages. Cost: num. of relevant pages

The final cost (expressed in terms of page accesses) is

log_ƒ(ℓ) + 1 + #relevant-pages

where ƒ is the fan-out and ℓ the number of leaves.

Unfortunately, in our case a tree on search key b must be unclustered, because the relation is already sorted by a

Case 2: unclustered tree with alt. 2 (or 3)

We also know that B⁺⁠-⁠trees are not so efficient in range queries when they are unclustered. Infact, having a tree with alternative 2 or 3, in the tree we'd store only the pointers to the records, so for each result that falls in the range we'd have to do a page access to a potential different page (because the relation has a different order with respect to the index).

To sum up:

Search for the first good result in the tree. Cost: log_ƒ(ℓ)
Follow scanning the leaf (and maybe other leaves) and do a different page access for each tuple that falls in the range. Cost: num. of other relevant leaves + num. of relevant tuples

The final cost (expressed in terms of page accesses) is

log_ƒ(ℓ) + #other-relevant-leaves + #relevant-tuples

notice that the number of tuples is pretty bigger respect to the number of pages!

Case 3: unclustered tree with alt. 1

Using alternative 1, we have all the data in the tree, so for executing the query we:

Search for the first good result in the tree. Cost: log_ƒ(ℓ)
Follow scanning the leaf (and maybe other leaves). Cost: num. of other relevant leaves

The final cost (expressed in terms of page accesses) is

log_ƒ(ℓ) + #other-relevant-leaves

that is even smaller than (or at most equal to) the cost of case 1, but this instead is allowed.

I hope I was clear enough.

N.B. The cost is expressed in terms of page accesses because the I/O operations from/to second⁠-⁠storage are the most expensive ones in terms of time (we ignore the cost of scanning a whole page in main memory but we consider just the cost of accessing it).

Upvotes: 2

Alternative 1 of index data entry

Answers (1)

Case 1: clustered tree with alt. 2

Case 2: unclustered tree with alt. 2 (or 3)

Case 3: unclustered tree with alt. 1

Related Questions