Andrew Craswell
Andrew Craswell

Reputation: 674

Group By Column, Select Most Recent Value

I'm performing a query on a table which tracks the results of a test taken by students. The test is composed of multiple sections, and there is a column for each section score. Each row is an instance of the test taken by a student. The sections can either be taken all at once, or split into multiple attempts. For example, a student can take one section today, and the rest tomorrow. In addition, a student is allowed to retake any section of the test.

Sample Student:

StudentID   WritingSection   ReadingSection   MathSection   DateTaken
1           65               85               54            4/1/2013 14:53
1           98               NULL             NULL          4/8/2013 13:13
1           NULL             NULL             38            5/3/2013 12:43

A NULL means that the section was not administered for the given test instance, and a second section score means the section was retaken.

I want a query that groups by the StudentID such that there is only one row per student, and the most recent score for each section is returned. I'm looking for an efficient way to solve this problem as we have many hundreds of thousands of test attempts in the database.

Expected Result:

StudentID    WritingSection    ReadingSection    MathSection    DateTaken
1            98                85                38             5/3/2013 12:43

EDIT: There have been a lot of good solutions. I want to experiment with each next week a little more before choosing the answer. Thanks everyone!

Upvotes: 3

Views: 2897

Answers (6)

Derrick Mink
Derrick Mink

Reputation: 1

SELECT student.studentid, 
       WRITE.writingsection, 
       READ.readingsection, 
       math.mathsection, 
       student.datetaken 
FROM 
-- list of students / max dates taken 
(SELECT studentid, 
        Max(datetaken) datetaken 
 FROM   test_record 
 GROUP  BY studentid) student, 
-- greatest date for student with a writingsection score (dont care what the date is here, just that the score comes from the greatest date) 
(SELECT studentid, 
        writingsection 
 FROM   test_record t 
 WHERE  writingsection IS NOT NULL 
        AND datetaken = (SELECT Max(datetaken) 
                         FROM   test_record 
                         WHERE  studentid = t.studentid 
                                AND writingsection IS NOT NULL)) WRITE, 
(SELECT studentid, 
        readingsection 
 FROM   test_record t 
 WHERE  readingsection IS NOT NULL 
        AND datetaken = (SELECT Max(datetaken) 
                         FROM   test_record 
                         WHERE  studentid = t.studentid 
                                AND readingsection IS NOT NULL)) READ, 
(SELECT studentid, 
        mathsection 
 FROM   test_record t 
 WHERE  mathsection IS NOT NULL 
        AND datetaken = (SELECT Max(datetaken) 
                         FROM   test_record 
                         WHERE  studentid = t.studentid 
                                AND mathsection IS NOT NULL)) math 
WHERE 
  -- outer join in case a student has no score recorded for one or more of the sections  
  student.studentid = READ.studentid(+) 
  AND student.studentid = WRITE.studentid(+) 
  AND student.studentid = math.studentid(+); 

Upvotes: 0

Mike Monteiro
Mike Monteiro

Reputation: 1457

Sorry - my previous answer answered a DIFFERENT question than the one posed :) It will return all data from the MOST RECENT row. The question asked is to aggregate over all rows to grab the most recent score for each subject individually.

But I'm leaving it up there because the question I answered is a common one, and maybe someone landing on this question actually had that question instead :)

Now to answer the actual question:

I think the cleanest way to do this is with PIVOT and UNPIVOT:

SELECT StudentID, [WritingSection], [ReadingSection], [MathSection], MAX(DateTaken) DateTaken
FROM (
  SELECT StudentID, Subject, DateTaken, Score
  FROM (
    SELECT StudentID, Subject, DateTaken, Score
      , row_number() OVER (PARTITION BY StudentID, Subject ORDER BY DateTaken DESC) as rowNum
    FROM Students s
    UNPIVOT (
      Score FOR Subject IN ([WritingSection],[ReadingSection],[MathSection])
    ) u
  ) x
  WHERE x.rowNum = 1
) y
PIVOT (
  MAX(Score) FOR Subject IN ([WritingSection],[ReadingSection],[MathSection])
) p
GROUP BY StudentID, [WritingSection], [ReadingSection], [MathSection]

The innermost subquery (x) uses SQL's UNPIVOT function to normalize the data (meaning to turn each student's score on each section of the test into a single row).

The next subquery out (y) is simply to filter the rows to only the most recent score FOR EACH SUBJECT INDIVIDUALLY (this is a workaround of the SQL bug that you can't use windowed functions like row_number() in a WHERE clause).

Lastly, since you want the data displayed back in the denormalized original format (1 column for each section of the test), we use SQL's PIVOT function. This simply turns rows into columns - one for each section of the test. Finally, you said you wanted the most recent test taken shown (despite the fact that each section could have its own unique "most recent" date). So we simply aggregate over those 3 potentially different DateTakens to find the most recent.

This will scale more easily than other solutions if there are more Sections added in the future - just add the column names to the list.

Upvotes: 2

Gordon Linoff
Gordon Linoff

Reputation: 1271111

This is tricky. Each section score is coming potentially from a different record. But the normal rules of max() and min() don't apply.

The following query gets a sequence number for each section, starting with the latest non-NULL value. This is then used for conditional aggregation in the outer query:

select s.StudentId,
       max(case when ws_seqnum = 1 then WritingSection end) as WritingSection,
       max(case when rs_seqnum = 1 then ReadingSection end) as ReadingSection,
       max(case when ms_seqnum = 1 then MathSection end) as MathSection,
       max(DateTaken) as DateTaken
from (select s.*,
             row_number() over (partition by studentid
                                order by (case when WritingSection is not null then 0 else 1 end), DateTaken desc
                               ) as ws_seqnum,
             row_number() over (partition by studentid
                                order by (case when ReadingSection is not null then 0 else 1 end), DateTaken desc
                               ) as rs_seqnum,
             row_number() over (partition by studentid
                                order by (case when MathSection is not null then 0 else 1 end), DateTaken desc
                               ) as ms_seqnum
      from student s
     ) s
where StudentId = 1
group by StudentId;

The where clause is optional in this query. You can remove it and it should still work on all students.

This query is more complicated than it needs to be, because the data is not normalized. If you have control over the data structure, consider an association/junction table, with one row per student per test with the score and test date as columns in the table. (Full normality would introduce another table for the test dates, but that probably isn't necessary.)

Upvotes: 2

Joe
Joe

Reputation: 1219

How about using the following to the maximum DateTaken?

SELECT max(DateTaken) FROM TABLE_NAME WHERE StudentID = 1

You could use that in a sub query to get a row like?

SELECT WritingSection FROM TABLE_NAME WHERE StudentID = 1 and DateTaken = (SELECT max(DateTaken) FROM TABLE_NAME WHERE StudentID = 1 and WritingSection IS NOT NULL)

You would need to run this twice more for ReadingSection and MathSection?

Upvotes: 0

Mike Monteiro
Mike Monteiro

Reputation: 1457

This is a pretty classic annoying problem in SQL - there's no super elegant way to do it. Here's the best I've found:

SELECT s.*
FROM Students s
JOIN (
  SELECT StudentID, MAX(DateTaken) as MaxDateTaken
  FROM Students
  GROUP BY StudentID
) f ON s.StudentID = f.StudentID AND s.DateTaken = f.MaxDateTaken

Joining on the date field isn't super ideal (this breaks in the event of ties for a MAX) or fast (depending on how the table is indexed). If you have an int rowID that is unique across all rows, it would be preferable to do:

SELECT s.*
FROM Students s
JOIN (
  SELECT rowID
  FROM (
    SELECT StudentID, rowID, row_number() OVER (PARTITION BY StudentID ORDER BY DateTaken DESC) as rowNumber
    FROM Students
  ) x
  WHERE x.rowNumber = 1
) f ON s.rowID = f.rowID

Upvotes: 0

Koshera
Koshera

Reputation: 449

Joe's solution will return only one student id - the one that took the test the latest. The way to get the latest date for each student id is to use analytical functions. Here's an example if you're using Oracle database:

SELECT a.StudentID, a.DateTaken
  FROM (  SELECT StudentID,
             DateTaken,
             ROW_NUMBER ()
                OVER (PARTITION BY StudentID ORDER BY DateTaken DESC)
                rn
        FROM pto.test
    ORDER BY DateTaken DESC) a
 WHERE a.rn = 1

Note how the row_number() funciton will put 1 at the latest date of each student id. And on the outer select you just filter those records with rn = 1... Execute only the inner select to see how it works. Let me know what kind of database you're using to give you a solution for it. Each database has it's own implementation of analytical functions but the logic is the same...

Upvotes: 0

Related Questions