jhpratt
jhpratt

Reputation: 7120

Why does SQLite store hundreds of null bytes?

In a database I'm creating, I was curious why the size was so much larger than the contents, and checked out the hex code. In a 4 kB file (single row as a test), there are two major chunks that are roughly 900 and 1000 bytes, along with a couple smaller ones that are all null bytes 0x0

I can't think of any logical reason it would be advantageous to store thousands of null bytes, increasing the size of the database significantly.

Can someone explain this to me? I've tried searching, and haven't been able to find anything.

Upvotes: 0

Views: 647

Answers (1)

Dai
Dai

Reputation: 155523

The structure of a SQLite database file (`*.sqlite) is described in this page:

https://www.sqlite.org/fileformat.html

SQLite files are partitioned into "pages" which are between 512 and 65536 bytes long - in your case I imagine the page size is probably 1KiB. If you're storing data that's smaller than 1KiB (as you are with your single test row, which I imagine is maybe 100 bytes long?) then that leaves 900 bytes left - and unused (deallocated) space is usually zeroed-out before (and after) use.

It's the same way computer working memory (RAM) works - as RAM also uses paging.

I imagine you expected the file to be very compact with a terse internal representation; this is the case with some file formats - such as old-school OLE-based Office documents but others (and especially database files) require a different file layout that is optimized simultaneously for quick access, quick insertion of new data, and is also arranged to help prevent internal fragmentation - this comes at the cost of some wasted space.

A quick thought-experiment will demonstrate why mutable (i.e. non-read-only) databases cannot use a compact internal file structure:

  1. Think of a single database table as being like a CSV file (and CSVs themselves are compact enough with very little wasted space).
    1. You can INSERT new rows by appending to the end of the file.
    2. You can DELETE an existing row by simply overwriting the row's space in the file with zeroes. Note that you cannot actually "delete" the space by "moving" data (like using the Backspace key in Notepad) because that means copying all of the data in the file around - this is largely a bad idea.
    3. You can UPDATE a row by checking to see if the new row's width will fit in the current space (and overwrite the remaining space with zeros), or if not, then append a new row at the end and overwrite the existing row (a-la INSERT then DELETE)
  2. But what if you have two database tables (with different columns) and need to store them in the same file? One approach is to simply mix each table's rows in the same flat file - but for other reasons that's a bad idea. So instead, inside your entire *.sqlite file, you create "sub-files", that have a known, fixed size (e.g. 4KiB) that store only rows for a single table until the sub-file is full; they also store a pointer (like a linked-list) to the next sub-file that contains the rest of the data, if any. Then you simply create new sub-files as you need more space inside the file and set-up their next-file pointers. These sub-files are what a "page" is in a database file, and is how you can have multiple read/write database tables contained within the same parent filesystem file.

Then in addition to these pages to store table data, you also need to store the indexes (which is what allows you to locate a table row near-instantly without needing to scan the entire table or file) and other metadata, such as the column-definitions themselves - and often they're stored in pages too. Relational (tabular) database files can be considered filesystems in their own right (just encapsulated in a parent filesystem... which could be inside a *.vhd file... which could be buried inside a varbinary database column... inside another filesystem), and even the database systems themselves have been compared to operating-systems (as they offer an environment for programs (stored procedures) to run, they offer IO services, and so on - it's almost circular if you look at the old COBOL-based mainframes from the 1970s when all of your IO operations were restricted to just computer record management operations (insert, update, delete).

Upvotes: 4

Related Questions