Reputation: 3584
I have a dataset that I need to work with that represents a part schematic for a large machine. I need to come up with an appropriate database schema for this dataset and am having trouble coming up with something to use that represents this data efficiently.
The top level components are the biggest "structures", and as you traverse down the hierarchy, the data represents inner components, or components that make up the inner components. For example, at the top level, there could be an engine as a level 1 component, and then a level 2 component is a piston, which goes into an engine, and a level 3 component could be a gasket that goes into the piston.
This representation is spread across a few hundred lines of a CSV file. There are 3 columns for IDs:
master_id
, which all components haveparent_id
, which all components have as well but their value varies based on the situation.
parent_id
is its own master_id. parent_id
is the master_id
of the level 1 component. parent_id
is the master_id
of the level 2 component. Basically, the parent id of any component is the master id of the component in the level above it. So lv1 parent is lv1 master (since it' s the root), lv2 parent is lv1 master, and lv3 is lv2 master. Also, multiple components can share a parent ID, meaning multiple lv2 parts, for example, can have the same parent ID.
grandparent_id
, which only level 3 components have (but not all lv3 components for some reason (idk I didn't make this data set)). If a component is lv3 and has a grandparent_id
, the grandparent ID is a direct link back to the master ID of the lv1 component. Yeah, confusing right? So here's an example. A lv3 component has a master_id
of 700000137, a parent_id
of 600000049, and a grandparent_id
of 500000006. If we look at the component with a master of 600000049, we'll see that this is a lv2 component that has a parent id of 500000006, which is the master id of a lv1 component, and again is the grandparent of this lv3 component.
I prefaced this post saying I need to come up with a database representation for this data set (it has later use in a project but the data organization is the first step). I'm comfortable using PostgreSQL, so my initial thoughts were to make 3 tables, master, parent, and grandparent, where based on the key that I'm parsing out, I would insert this into the appropriate database and foreign key back to the other tables if there were parent or grandparent keys. But I realized this could get quite hairy especially since there could be multiple foreign keys linking back to a single master id, and I feel with this representation some data could possibly get repeated, which I obviously don't want happening.
My second thought was to use something like a python dictionary, where I essentially build out a tree like structure where the lv1 components are in the top level, the lv2 components in the second, etc. I could then convert the dictionary into JSON, since Python is nice that way, and store that json blob in the database. But, this JSON blob could potentially get REALLY big, though I guess that's just something I'd have to live with as the dataset grows. This part schematic I was given is only for one machine, so basically each entry in my database would be like
id | name | json
----------------------
1 | machine_a | JSON_BLOB_MACHINE_A
----------------------
2 | machine_b | JSON_BLOB_MACHINE_B
etc...
does my second approach seem better than trying to create separate tables that represent each part level and foreign keying back to parents? If there's a better way to do this with Postgres, I'd appreciate you explaining it. Otherwise, I'm probably going to go with the latter route. Thanks!
Upvotes: 0
Views: 632
Reputation: 18909
If you don't need to join parts in other machines, then I think a jsonb
column for parts may be best. You can still index jsonb using GIN indexes and get really good performance from queries.
As long as the parts are not shared among many machines, which would make updating part properties across all machines tricky, then you probably OK.
This should make queries for a machine pretty effortless as majority of the data is self-contained.
Upvotes: 1