SQL Schema Design Advice

Question

I have a 'users' table which has a bunch of concrete "sure" properties about my users all of which must be there and their veracity is certain and then I have a separate table 'users_derived' where all data in this table is derived properties of my users guessed by machine learning models. For example: 'age' might be a certain property since they supplied it to me, 'height' or 'hair color' might be a derived property since an ML model guessed it from a picture. The main difference is all properties in the 'users' table were given to me by the user themselves and have complete certainty whereas all properties in the 'user_derived' table have both the value and a certainty associated with it and were guessed at by my system. The other difference is all properties of the 'users' table will be there for every user, while any property in the 'users_derived' table may or may not be there. From time to time I add new ML models which guess at more properties of users as well.

My question is how to do the schema for the 'users_derived' table. I could do it like this:

userid  |  prop1  | certainty1  |  prop2  | certainty2 | prop3 |  etc ...
123         7         0.57         5'8''       0.82       red
124         12        0.6          NULL        NULL       black
125         NULL      NULL         6'1''       0.88       blonde

or I could do it like this with slightly different indexing:

userid   |  property  |  value   |   certainty 
 123           1           7            0.57
 123           2          5'8''         0.82
 124           1           12           0.60
 123           3          red           0.67
 124           3          black         0.61
 125           2          6'1''         0.88
                       etc ....

So the tradeoffs seem like in the second way it isn't as normalized and might be slightly harder to query but you don't have to know all the properties you care about in advance -- that is if I want to add a new property there is no schema change. Also there don't have to be any NULL spots since if we don't have that property yet we just don't have a row for it. What am I missing? What are the benefits of the first way? Are there queries I can do against the first schema that are hard or impossible in the second schema? Does the second way somehow need more space for indexing to make it fast?

SQL Schema Design Advice

Answers (1)

Related Questions