Reputation: 211
The author of Bitmap Scan described the difference between Bitmap Heap Scan and Index Scan:
A plain indexscan fetches one tuple-pointer at a time from the index, and immediately visits that tuple in the table. A bitmap scan fetches all the tuple-pointers from the index in one go, sorts them using an in-memory "bitmap" data structure, and then visits the table tuples in physical tuple-location order. The bitmap scan improves locality of reference to the table at the cost of more bookkeeping overhead to manage the "bitmap" data structure --- and at the cost that the data is no longer retrieved in index order, which doesn't matter for your query but would matter if you said ORDER BY.
Questions:
Why does it sort the fetched tuple-pointers again, when the index is already sorted?
How does it sort with bitmap? I know what bitmap is, but I don't understand how it can be used for sorting.
Why is it faster than Index Scan when fetching a moderately large percentage of the table? On the contrary, it seems to add quite a bit of computing to the process.
Upvotes: 8
Views: 4635
Reputation: 6726
A bitmap is a data structure used by Postgres to temporarily store the row pointers it's found on an index scan.
If Postgres doesn't use a bitmap, then for every row pointer it finds in the index, it needs to immediately go to fetch the row from the table. This can be slow if rows are in different data pages*. Using a bitmap, Postgres stores the row pointers in memory grouped by data page, and then it can fetch the rows faster going page by page. The downside is that it needs additional memory to maintain the bitmap. In fact, if the index scan returns too many row pointers, creating the exact bitmap might be impractical (too much space). In those situations Postgres can use a "lossy bitmap", that only keeps track of the data pages and not the specific row pointers. At the time of fetching the actual rows using a lossy bitmap, Postgres needs to read the entire data pages and recheck the conditions to find the matching rows.
To clarify the OP questions, bitmaps do not sort row pointers based on the ORDER BY
criteria and the original order in which row pointers were found in the index is lost because the bitmap groups them by data page, that's why an extra step for sorting is needed after.
* This statement assumes that reading data that is physically distant takes longer than reading data that is physically closer. This might not apply to all modern storage devices.
Upvotes: 1
Reputation: 656706
The backbone of Postgres storage is made up of data pages of 8 kilobytes in a typical installation. Each data page typically holds many tuples. Read the details of physical storage in the manual.
The "bitmap" in a bitmap scan is a way to collect tuple pointers in buckets of data pages. Index sort order is necessarily lost in this process in favor of physical sort order. In "lossy mode" (which only occurs if the result is so huge or workmem
so small that even the tiny bitmap won't fit) only block numbers are kept and corresponding tuple indexes discarded.
After that, each data page is only visited once from storage to extract (possibly) multiple tuples and in physical sequence, which also matters for some types of storage. In lossy mode, tuples from each identified page have to be filtered by rechecking the index condition; else tuples can be retrieved directly using collected tuple indexes.
In an index scan, each page may have to be visited multiple times if multiple tuples end up to be stored in the same data page. The actual process is even more sophisticated. Related:
To your questions:
The sorting of the index is lost due to collecting hits and reading them out data page by data page.
Hence, the result has to be sorted again, in an added sort step - if the sort order is required (by ORDER BY
, for instance).
The number of data pages that have to be read is the most prominent factor for overall performance. Bitmap index scans reduce that number to a minimum. With faster storage, the benefit of bitmap index scans gets smaller. That's why accurate cost settings are crucial for the query planner to make good decisions.
Upvotes: 7