Reputation: 674
I'm performing a query on a table which tracks the results of a test taken by students. The test is composed of multiple sections, and there is a column for each section score. Each row is an instance of the test taken by a student. The sections can either be taken all at once, or split into multiple attempts. For example, a student can take one section today, and the rest tomorrow. In addition, a student is allowed to retake any section of the test.
Sample Student:
StudentID WritingSection ReadingSection MathSection DateTaken 1 65 85 54 4/1/2013 14:53 1 98 NULL NULL 4/8/2013 13:13 1 NULL NULL 38 5/3/2013 12:43
A NULL
means that the section was not administered for the given test instance, and a second section score means the section was retaken.
I want a query that groups by the StudentID
such that there is only one row per student, and the most recent score for each section is returned. I'm looking for an efficient way to solve this problem as we have many hundreds of thousands of test attempts in the database.
Expected Result:
StudentID WritingSection ReadingSection MathSection DateTaken 1 98 85 38 5/3/2013 12:43
EDIT: There have been a lot of good solutions. I want to experiment with each next week a little more before choosing the answer. Thanks everyone!
Upvotes: 3
Views: 2897
Reputation: 1
SELECT student.studentid,
WRITE.writingsection,
READ.readingsection,
math.mathsection,
student.datetaken
FROM
-- list of students / max dates taken
(SELECT studentid,
Max(datetaken) datetaken
FROM test_record
GROUP BY studentid) student,
-- greatest date for student with a writingsection score (dont care what the date is here, just that the score comes from the greatest date)
(SELECT studentid,
writingsection
FROM test_record t
WHERE writingsection IS NOT NULL
AND datetaken = (SELECT Max(datetaken)
FROM test_record
WHERE studentid = t.studentid
AND writingsection IS NOT NULL)) WRITE,
(SELECT studentid,
readingsection
FROM test_record t
WHERE readingsection IS NOT NULL
AND datetaken = (SELECT Max(datetaken)
FROM test_record
WHERE studentid = t.studentid
AND readingsection IS NOT NULL)) READ,
(SELECT studentid,
mathsection
FROM test_record t
WHERE mathsection IS NOT NULL
AND datetaken = (SELECT Max(datetaken)
FROM test_record
WHERE studentid = t.studentid
AND mathsection IS NOT NULL)) math
WHERE
-- outer join in case a student has no score recorded for one or more of the sections
student.studentid = READ.studentid(+)
AND student.studentid = WRITE.studentid(+)
AND student.studentid = math.studentid(+);
Upvotes: 0
Reputation: 1457
Sorry - my previous answer answered a DIFFERENT question than the one posed :) It will return all data from the MOST RECENT row. The question asked is to aggregate over all rows to grab the most recent score for each subject individually.
But I'm leaving it up there because the question I answered is a common one, and maybe someone landing on this question actually had that question instead :)
Now to answer the actual question:
I think the cleanest way to do this is with PIVOT and UNPIVOT:
SELECT StudentID, [WritingSection], [ReadingSection], [MathSection], MAX(DateTaken) DateTaken
FROM (
SELECT StudentID, Subject, DateTaken, Score
FROM (
SELECT StudentID, Subject, DateTaken, Score
, row_number() OVER (PARTITION BY StudentID, Subject ORDER BY DateTaken DESC) as rowNum
FROM Students s
UNPIVOT (
Score FOR Subject IN ([WritingSection],[ReadingSection],[MathSection])
) u
) x
WHERE x.rowNum = 1
) y
PIVOT (
MAX(Score) FOR Subject IN ([WritingSection],[ReadingSection],[MathSection])
) p
GROUP BY StudentID, [WritingSection], [ReadingSection], [MathSection]
The innermost subquery (x) uses SQL's UNPIVOT function to normalize the data (meaning to turn each student's score on each section of the test into a single row).
The next subquery out (y) is simply to filter the rows to only the most recent score FOR EACH SUBJECT INDIVIDUALLY (this is a workaround of the SQL bug that you can't use windowed functions like row_number() in a WHERE clause).
Lastly, since you want the data displayed back in the denormalized original format (1 column for each section of the test), we use SQL's PIVOT function. This simply turns rows into columns - one for each section of the test. Finally, you said you wanted the most recent test taken shown (despite the fact that each section could have its own unique "most recent" date). So we simply aggregate over those 3 potentially different DateTakens to find the most recent.
This will scale more easily than other solutions if there are more Sections added in the future - just add the column names to the list.
Upvotes: 2
Reputation: 1271111
This is tricky. Each section score is coming potentially from a different record. But the normal rules of max()
and min()
don't apply.
The following query gets a sequence number for each section, starting with the latest non-NULL value. This is then used for conditional aggregation in the outer query:
select s.StudentId,
max(case when ws_seqnum = 1 then WritingSection end) as WritingSection,
max(case when rs_seqnum = 1 then ReadingSection end) as ReadingSection,
max(case when ms_seqnum = 1 then MathSection end) as MathSection,
max(DateTaken) as DateTaken
from (select s.*,
row_number() over (partition by studentid
order by (case when WritingSection is not null then 0 else 1 end), DateTaken desc
) as ws_seqnum,
row_number() over (partition by studentid
order by (case when ReadingSection is not null then 0 else 1 end), DateTaken desc
) as rs_seqnum,
row_number() over (partition by studentid
order by (case when MathSection is not null then 0 else 1 end), DateTaken desc
) as ms_seqnum
from student s
) s
where StudentId = 1
group by StudentId;
The where
clause is optional in this query. You can remove it and it should still work on all students.
This query is more complicated than it needs to be, because the data is not normalized. If you have control over the data structure, consider an association/junction table, with one row per student per test with the score and test date as columns in the table. (Full normality would introduce another table for the test dates, but that probably isn't necessary.)
Upvotes: 2
Reputation: 1219
How about using the following to the maximum DateTaken?
SELECT max(DateTaken) FROM TABLE_NAME WHERE StudentID = 1
You could use that in a sub query to get a row like?
SELECT WritingSection FROM TABLE_NAME WHERE StudentID = 1 and DateTaken = (SELECT max(DateTaken) FROM TABLE_NAME WHERE StudentID = 1 and WritingSection IS NOT NULL)
You would need to run this twice more for ReadingSection and MathSection?
Upvotes: 0
Reputation: 1457
This is a pretty classic annoying problem in SQL - there's no super elegant way to do it. Here's the best I've found:
SELECT s.*
FROM Students s
JOIN (
SELECT StudentID, MAX(DateTaken) as MaxDateTaken
FROM Students
GROUP BY StudentID
) f ON s.StudentID = f.StudentID AND s.DateTaken = f.MaxDateTaken
Joining on the date field isn't super ideal (this breaks in the event of ties for a MAX) or fast (depending on how the table is indexed). If you have an int rowID that is unique across all rows, it would be preferable to do:
SELECT s.*
FROM Students s
JOIN (
SELECT rowID
FROM (
SELECT StudentID, rowID, row_number() OVER (PARTITION BY StudentID ORDER BY DateTaken DESC) as rowNumber
FROM Students
) x
WHERE x.rowNumber = 1
) f ON s.rowID = f.rowID
Upvotes: 0
Reputation: 449
Joe's solution will return only one student id - the one that took the test the latest. The way to get the latest date for each student id is to use analytical functions. Here's an example if you're using Oracle database:
SELECT a.StudentID, a.DateTaken
FROM ( SELECT StudentID,
DateTaken,
ROW_NUMBER ()
OVER (PARTITION BY StudentID ORDER BY DateTaken DESC)
rn
FROM pto.test
ORDER BY DateTaken DESC) a
WHERE a.rn = 1
Note how the row_number() funciton will put 1 at the latest date of each student id. And on the outer select you just filter those records with rn = 1... Execute only the inner select to see how it works. Let me know what kind of database you're using to give you a solution for it. Each database has it's own implementation of analytical functions but the logic is the same...
Upvotes: 0