Basic database design: struggling to map dictionary thinking to relational thinking. How would this data be modeled?

Question

I've been programming for a decent amount of time, but the whole database thing has slipped completely by me. I'm now trying to pick up some of this, but I'm struggling with how to set up the relations.

The Purpose:

I work in a videography shop, so we have tons of files which need to be tagged with metadata. On the file system the individual files are numbered sequentially, and stored in a folder consisting of the project's ID number, and a csv file with all the metadata.

e.g.

Projects/ 
   |
   | - Project_ID_4539485/
              | - metatdata.csv
              | - Rec_01.mp4
              | - Rec_02.mp4
              | - ...
              | - Rec_N.mp4
   | - Project_ID_4539485/
            etc...

We keep metadata in a csv file like this:

| filename | Title | Album | Artist | etc.. |

I have a Python script which runs through and automatically tags all of the individual files based on that metadata csv. This project is an attempt at scaling up that script so that it doesn't need to be touched by humans anymore.

Modelling:

There are three main things I want to keep track of:

the metadata in each folder
A log of which files have been tagged and when
A list of all folders currently being watched by the script.

If I were to store this as JSON, it's relatively easy for me to think in that kind of nested dictionary way, but I'm breaking down when I try to translate it to relational stuff.

As a mess of dictionaries and lists, I'd set it up something like this:

watch_folders = {
    'Project_ID_4539485' : {
        'metadata' : {'Title' : 'bla', 'artist' : 'foo', etc..},
        'indexed' : [[filename, checksum, last_modified], 
                     [filename, checksum, last_modified],
                     etc..]
        },
    'Project_ID_4539682' : {
        'metadata' : {'Title' : 'bla', 'artist' : 'foo', etc..},
        'indexed' : [[filename, checksum, last_modified], 
                     [filename, checksum, last_modified],
                     etc..]
        },
    # and so on... 
}

Attempting to translate I end up with something like this:

create table metadata (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    filename INTEGER, 
    title TEXT, 
    artist TEXT,
    album TEXT, 
    year TEXT,
    genre TEXT, 
    publisher TEXT, 
    url TEXT
);

create table indexed (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    filename TEXT, 
    checksum TEXT,
    date_modified TEXT
);

create table project_folder (
    project_id INTEGER PRIMARY KEY, 
    metadata_key INTEGER, 
    indexed_key INTEGER
);

But I'm not sure how to "attach" all of the tables to each other.

For instance, starting with the metadata. I need one key which ties the metadata table to the project table. I'm not quite sure how to do that as the keys in metadata are all unique and related to external files -- not to mention those keys will be the same in each project folder as the individual files are all numbered sequentially.

So, it seems like the solution is just adding the project_id field to to the metadata table.

create table metadata (

    project_id PRIMARY KEY,
    filename INTEGER, 
    title TEXT, 
    artist TEXT,
    album TEXT, 
    year TEXT,
    genre TEXT, 
    publisher TEXT, 
    url TEXT
);

But this seems to violate what I understand about normalization as now every single row will have one piece of repeated information. I run into the same problem trying to attach the indexed table..

Is adding a project_id column in fact the way to go about it, or is there a better way?

Basic database design: struggling to map dictionary thinking to relational thinking. How would this data be modeled?

The Purpose:

Modelling:

Answers (1)

Related Questions