Normalize DB from flat CSV input file for higher performance?

Question

Thanks a lot for you help! I have tried to break down my question:

Background

I am trying to understand how our files servers grow over time, and I would like to do this by collecting usage data and running statistics on it (space required by user, biggest directories, fastest growing directory, etc.).

To collect one snapshot I have written a script that walks through the directory tree and writes the data to a CSV file with the following header:
SCANID, SCANDATE, DIR, FILENAME, SIZE, MODTIME, ATIME, OWNER

SCANID: random uuid4 string identifiying the scan
SCANDATE: the date when the scan was started (not necessarily 1:1 relationship with SCANID)

Challenges

I can import this into an equivalent table in MySQL, but I quickly run into performance issues. There are millions of files on our system. The MySQL table is the following:

CREATE TABLE IF NOT EXISTS FILES (
    SCANID VARCHAR(40),
    SCANDATE DATE,
    DIRECTORY VARCHAR(700),
    FILENAME VARCHAR(300),
    SIZE BIGINT,
    ATIME DATETIME,
    MODTIME DATETIME,
    OWNER VARCHAR(50)
);

I believe a big chunk of the space is taken up by DIRECTORY and FILENAME, especially if you keep repeating them for every scan that I do.

The analysis that I would like to do with the data are the following:
* how fast does the total required disk space grow over time
* which user uses most space - SELECT OWNER, SUM(SIZE) FROM FILES GROUP BY OWNER ORDER BY SUM(SIZE) DESC;
* which directories grow fastest over time
* which users generate most data over time

Solution Approach

Create three tables, and only collect the data that changes (I only want to know if sizes change, and only keep the latest owner, adate and mdate)

Table 1: FILES

CREATE TABLE IF NOT EXISTS FILES (
    SCAN_ID INT,
    DIR VARCHAR(700) DEFAULT NULL,
    FILENAME VARCHAR(500) DEFAULT NULL,
    ATIME DATETIME DEFAULT NULL, 
    MODTIME DATETIME DEFAULT NULL,
    OWNER VARCHAR(50) DEFAULT NULL, 
    PRIMARY KEY (DIR, FNAME)
);

Table 2: SIZES
SCAN_ID, FILE_ID, SIZE

Table 3: SCANDATES
SCAN_ID, SCANDATE

To get there, I am loading the full CSV file into a temporary table (FILES_IMPORT) with the same structure as the CSV file, and then distribute the date into the respective tables.

Table 1 should therefore have all the filenames that I have ever seen:

INSERT INTO FILES (DIR, FNAME, OWNER, ATIME, MODTIME)
    SELECT DIR, FNAME, OWNER, ATIME, MODTIME FROM FILES_IMPORT
    ON DUPLICATE KEY UPDATE
        OWNER=VALUES(OWNER),
        ATIME=VALUES(ATIME),
        MODTIME=VALUES(MODTIME);

As (DIR, FILENAME) is a primary key, this fulfills the requirement of only having one entry per file.

Question

However, I do not know how I would update the SIZES table. I do not want to have (DIR, FNAME) as PRIMARY KEY there, because that would not make the DB smaller or faster. Apart from this I do not know if my approach is reasonable to begin with.

Thanks a bunch for your help.

Bill Karwin · Accepted Answer

I'll just write a few thoughts in no particular order.

You haven't described any specific SQL queries you will run against this data, so we can't make a strategy about indexing or other optimizations. Those are important, and frequently more important than the size of the data.

If you're doing complex reporting queries that need to scan all rows anyway, you might need to use a column-oriented storage engine to get better performance. Again, this depends on the types of queries you need to optimize for.

I don't know why you have a SCAN_ID (4 bytes) when you could just use SCAN_DATE (3 bytes) since there's a 1:1 relationship (btw, please don't use the term "relation" when you mean "relationship"). If you don't use a SCAN_ID, then you can just use the SCAN_DATE, FILE_ID as the primary key of your SIZES table.

You could use the file's inode number as the file's primary key. Though inode numbers are numbered independently on different filesystems, so if your server mounts multiple filesystems, you'd have to include some identifier for the filesystem in the primary key too.

Normalize DB from flat CSV input file for higher performance?

Answers (1)

Related Questions