Reputation: 3123

Creating a denormalized table from a normalized key-value table using 100s of joins

I have an ETL process which takes values from an input table which is a key value table with each row having a field ID and turning it into a more denormalized table where each row has all the values. Specifically, this is the input table:

StudentFieldValues (
    FieldId INT NOT NULL,
    StudentId INT NOT NULL,
    Day DATE NOT NULL,
    Value FLOAT NULL
)

FieldId is a foreign key from table Field, Day is a foreign key from table Days. The PK is the first 3 fields. There are currently 188 distinct fields. The output table is along the lines of:

StudentDays (
    StudentId INT NOT NULL,
    Day DATE NOT NULL,
    NumberOfClasses FLOAT NULL,
    MinutesLateToSchool FLOAT NULL,
    ... -- the rest of the 188 fields
)

The PK is the first 2 fields.

Currently the query that populates the output table does a self join with StudentFieldValues 188 times, one for each field. Each join equates StudentId and Day and takes a different FieldId. Specifically:

SELECT Students.StudentId, Days.Day, 
       StudentFieldValues1.Value NumberOfClasses, 
       StudentFieldValues2.Value MinutesLateToSchool,
       ...
INTO StudentDays
FROM Students
CROSS JOIN Days
LEFT OUTER JOIN StudentFieldValues StudentFieldValues1 
ON Students.StudentId=StudentFieldValues1.StudentId AND 
   Days.Day=StudentFieldValues1.Day AND
   AND StudentFieldValues1.FieldId=1
LEFT OUTER JOIN StudentFieldValues StudentFieldValues2 
ON Students.StudentId=StudentFieldValues2.StudentId AND 
   Days.Day=StudentFieldValues2.Day AND 
   StudentFieldValues2.FieldId=2
... -- 188 joins with StudentFieldValues table, one for each FieldId

I'm worried that this system isn't going to scale as more days, students and fields (especially fields) are added to the system. Already there are 188 joins and I keep reading that if you have a query with that number of joins you're doing something wrong. So I'm basically asking: Is this something that's gonna blow up in my face soon? Is there a better way to achieve what I'm trying to do? It's important to note that this query is minimally logged and that's something that wouldn't have been possible if I was adding the fields one after the other.

More details:

MS SQL Server 2014, 2x XEON E5 2690v2 (20 cores, 40 threads total), 128GB RAM. Windows 2008R2.
352 million rows in the input table, 18 million rows in the output table - both expected to increase over time.
Query takes 20 minutes and I'm very happy with that, but performance degrades as I add more fields.

Upvotes: 1

Answers (3)

Gordon Linoff

Reputation: 1270713

Think about doing this using conditional aggregation:

SELECT s.StudentId, d.Day, 
       max(case when sfv.FieldId = 1 then sfv.Value end) as NumberOfClasses, 
       max(case when sfv.FieldId = 2 then sfv.Value end) as MinutesLateToSchool,
       ...
INTO StudentDays
FROM Students s CROSS JOIN
     Days d LEFT OUTER JOIN
     StudentFieldValues sfv 
     ON s.StudentId = sfv.StudentId AND 
        d.Day = sfv.Day 
GROUP BY s.StudentId, d.Day;

This has the advantage of easy scalability. You can add hundreds of fields and the processing time should be comparable (longer, but comparable) to fewer fields. It is also easer to add new fields.

EDIT:

A faster version of this query would use subqueries instead of aggregation:

SELECT s.StudentId, d.Day, 
       (SELECT TOP 1 sfv.Value FROM StudentFieldValues WHERE sfv.FieldId = 1 and sfv.StudentId = s.StudentId and sfv.Day = sfv.Day) as NumberOfClasses, 
        (SELECT TOP 1 sfv.Value FROM StudentFieldValues WHERE sfv.FieldId = 2 and sfv.StudentId = s.StudentId and sfv.Day = sfv.Day) as MinutesLateToSchool,
       ...
INTO StudentDays
FROM Students s CROSS JOIN
     Days d;

For performance, you want a composite index on StudentFieldValues(StudentId, day, FieldId, Value).

Upvotes: 2

Cole W

Reputation: 15313

I think there may be some trial and error here to see what works but here are some things you can try:

Disable indexes and re-enable after data load is complete
Disable any triggers that don't need to be ran upon data load scenarios.

The above was taken from an msdn post where someone was doing something similar to what you are.

Think about trying to only update the de-normalized table based on changed records if this is possible. Limiting the result set would be much more efficient if this is a possibility.
You could try a more threaded iterative approach in code (C#, vb, etc) to build this table by student where you aren't doing the X number of joins all at one time.

Upvotes: 0

Bill Karwin

Reputation: 562721

Yes, this is going to blow up. You have your definitions of "normalized" and "denormalized" backwards. The Field/Value table design is not a relational design. It's a variation of the entity-attribute-value design, which has all sorts of problems.

I recommend you do not try to pivot the data in an SQL query. It doesn't scale well that way. Instea, you need to query it as a set of rows, as it is stored in the database, and fetch back the result set into your application. There you write code to read the data row by row, and apply the "fields" to fields of an object or a hashmap or something.

Upvotes: 2

Creating a denormalized table from a normalized key-value table using 100s of joins

Answers (3)

Related Questions