Reputation: 3123
I have an ETL process which takes values from an input table which is a key value table with each row having a field ID and turning it into a more denormalized table where each row has all the values. Specifically, this is the input table:
StudentFieldValues (
FieldId INT NOT NULL,
StudentId INT NOT NULL,
Day DATE NOT NULL,
Value FLOAT NULL
)
FieldId
is a foreign key from table Field
, Day
is a foreign key from table Days
. The PK is the first 3 fields. There are currently 188 distinct fields. The output table is along the lines of:
StudentDays (
StudentId INT NOT NULL,
Day DATE NOT NULL,
NumberOfClasses FLOAT NULL,
MinutesLateToSchool FLOAT NULL,
... -- the rest of the 188 fields
)
The PK is the first 2 fields.
Currently the query that populates the output table does a self join with StudentFieldValues
188 times, one for each field. Each join equates StudentId
and Day
and takes a different FieldId
. Specifically:
SELECT Students.StudentId, Days.Day,
StudentFieldValues1.Value NumberOfClasses,
StudentFieldValues2.Value MinutesLateToSchool,
...
INTO StudentDays
FROM Students
CROSS JOIN Days
LEFT OUTER JOIN StudentFieldValues StudentFieldValues1
ON Students.StudentId=StudentFieldValues1.StudentId AND
Days.Day=StudentFieldValues1.Day AND
AND StudentFieldValues1.FieldId=1
LEFT OUTER JOIN StudentFieldValues StudentFieldValues2
ON Students.StudentId=StudentFieldValues2.StudentId AND
Days.Day=StudentFieldValues2.Day AND
StudentFieldValues2.FieldId=2
... -- 188 joins with StudentFieldValues table, one for each FieldId
I'm worried that this system isn't going to scale as more days, students and fields (especially fields) are added to the system. Already there are 188 joins and I keep reading that if you have a query with that number of joins you're doing something wrong. So I'm basically asking: Is this something that's gonna blow up in my face soon? Is there a better way to achieve what I'm trying to do? It's important to note that this query is minimally logged and that's something that wouldn't have been possible if I was adding the fields one after the other.
More details:
Upvotes: 1
Views: 907
Reputation: 1270713
Think about doing this using conditional aggregation:
SELECT s.StudentId, d.Day,
max(case when sfv.FieldId = 1 then sfv.Value end) as NumberOfClasses,
max(case when sfv.FieldId = 2 then sfv.Value end) as MinutesLateToSchool,
...
INTO StudentDays
FROM Students s CROSS JOIN
Days d LEFT OUTER JOIN
StudentFieldValues sfv
ON s.StudentId = sfv.StudentId AND
d.Day = sfv.Day
GROUP BY s.StudentId, d.Day;
This has the advantage of easy scalability. You can add hundreds of fields and the processing time should be comparable (longer, but comparable) to fewer fields. It is also easer to add new fields.
EDIT:
A faster version of this query would use subqueries instead of aggregation:
SELECT s.StudentId, d.Day,
(SELECT TOP 1 sfv.Value FROM StudentFieldValues WHERE sfv.FieldId = 1 and sfv.StudentId = s.StudentId and sfv.Day = sfv.Day) as NumberOfClasses,
(SELECT TOP 1 sfv.Value FROM StudentFieldValues WHERE sfv.FieldId = 2 and sfv.StudentId = s.StudentId and sfv.Day = sfv.Day) as MinutesLateToSchool,
...
INTO StudentDays
FROM Students s CROSS JOIN
Days d;
For performance, you want a composite index on StudentFieldValues(StudentId, day, FieldId, Value)
.
Upvotes: 2
Reputation: 15313
I think there may be some trial and error here to see what works but here are some things you can try:
The above was taken from an msdn post where someone was doing something similar to what you are.
Upvotes: 0
Reputation: 562721
Yes, this is going to blow up. You have your definitions of "normalized" and "denormalized" backwards. The Field/Value table design is not a relational design. It's a variation of the entity-attribute-value design, which has all sorts of problems.
I recommend you do not try to pivot the data in an SQL query. It doesn't scale well that way. Instea, you need to query it as a set of rows, as it is stored in the database, and fetch back the result set into your application. There you write code to read the data row by row, and apply the "fields" to fields of an object or a hashmap or something.
Upvotes: 2