msbir
msbir

Reputation: 67

Optimizing storage space for SQLite database

I have a SQLite database which I am using to store mostly file system activity. There are a huge number of repetitive paths and the disk usage is growing very fast. The data looks in the following way:

16  1570041298  0   4   C:\PROGRAM FILES (X86)\GOOGLE\CHROME\APPLICATION\77.0.3865.90\CHROME_CHILD.DLL      86634480
16  1570041298  0   1420    C:\PROGRAM FILES (X86)\GOOGLE\CHROME\APPLICATION\77.0.3865.90\CHROME_CHILD.DLL      86634480
16  1570041298  0   1420    C:\PROGRAM FILES (X86)\GOOGLE\CHROME\APPLICATION\77.0.3865.90\CHROME_CHILD.DLL      86634480
16  1570041298  0   4   C:\PROGRAM FILES (X86)\GOOGLE\CHROME\APPLICATION\77.0.3865.90\CHROME_CHILD.DLL      86634480
16  1570041298  0   2840    C:\PROGRAM FILES (X86)\GOOGLE\CHROME\APPLICATION\77.0.3865.90\CHROME_CHILD.DLL      86634480
16  1570041299  0   2840    C:\PROGRAM FILES (X86)\GOOGLE\CHROME\APPLICATION\77.0.3865.90\CHROME_CHILD.DLL      86634480
16  1570041299  0   3192    C:\PROGRAM FILES (X86)\GOOGLE\CHROME\APPLICATION\77.0.3865.90\CHROME_CHILD.DLL      86634480
16  1570041299  0   3192    C:\PROGRAM FILES (X86)\GOOGLE\CHROME\APPLICATION\77.0.3865.90\CHROME_CHILD.DLL      86634480

I have seen similar applications replacing the long strings with a long integer and basically using that to reference it. My question is, since this is not supported natively in SQLite, what are my design options to achieve the best optimization (what algorithm to use, will I still be able to use transactions to insert multiple records in the same time, etc.?)

Thank you!

Upvotes: 0

Views: 753

Answers (1)

Shawn
Shawn

Reputation: 52409

You're on the right track with wanting to use numeric ids instead of strings for the paths. You want a table that maps unique pathnames to id numbers, and use those ids as foreign keys in the original table.

Something like:

CREATE TABLE pathnames(id INTEGER PRIMARY KEY, path TEXT NOT NULL UNIQUE);
CREATE TABLE activity(id INTEGER PRIMARY KEY
                    , path_id INTEGER NOT NULL REFERENCES pathnames(id)
                    , timestamp INTEGER NOT NULL DEFAULT (strftime('%s', 'now'))
                    , etc);
CREATE INDEX activity_idx_path_id ON activity(path_id);

Then it's easy to add a path if it doesn't exist, and look it up by name when inserting a row into the activity table:

BEGIN;
INSERT OR IGNORE INTO pathnames(path) VALUES ('C:\whatever');
INSERT INTO activity(path_id, etc)
 VALUES ((SELECT id FROM pathnames WHERE path = 'C:\whatever'), 'created');
COMMIT;

and to look up everything for a specific file, something like

SELECT timestamp, etc
FROM activity
WHERE path_id = (SELECT id FROM pathnames WHERE path = 'C:\whatever')
ORDER BY timestamp;

except of course in your application you'd use parameters instead of the literal strings for all these statements.

Upvotes: 1

Related Questions