Reputation: 87
I have to find a pair of students who take exactly the same classes from table that has studentID
and courseID
.
studentID | courseID
1 1
1 2
1 3
2 1
3 1
3 2
3 3
Query should return (1, 3)
.
The result also should not have duplicate rows such as (1,3)
and (3,1)
.
Upvotes: 2
Views: 5547
Reputation: 11
Process to get this done in mysql
Create table student_course_agg
(
student_id int,
courses varchar(150)
);
INSERT INTO student_course_agg
select studentID ,GROUP_CONCAT(courseID ORDER BY courseID) courses
FROM STUDENTS
GROUP BY 1;
SELECT master.student_id m_student_id,child.student_id c_student_id
FROM student_course_agg master
JOIN student_course_ag child
ON master.student_id<child.student_id and master.courses=child.courses;
Direct query.
SELECT master.student_id m_student_id,child.student_id c_student_id
FROM (select studentID ,GROUP_CONCAT(courseID ORDER BY courseID) courses
FROM STUDENTS
GROUP BY 1) master
JOIN (select studentID ,GROUP_CONCAT(courseID ORDER BY courseID) courses
FROM STUDENTS
GROUP BY 1) child
ON master.studentID <child.studentID and master.courses=child.courses;
Upvotes: 0
Reputation: 44250
Naive relational division implementation, with CTE:
WITH pairs AS (
SELECT DISTINCT a.student_id AS aaa
, b.student_id AS bbb
FROM student_course a
JOIN student_course b ON a.course_id = b.course_id
)
SELECT *
FROM pairs p
WHERE p.aaa < p.bbb
AND NOT EXISTS (
SELECT * FROM student_course nx1
WHERE nx1.student_id = p.aaa
AND NOT EXISTS (
SELECT * FROM student_course nx2
WHERE nx2.student_id = p.bbb
AND nx2.course_id = nx1.course_id
)
)
AND NOT EXISTS (
SELECT * FROM student_course nx1
WHERE nx1.student_id = p.bbb
AND NOT EXISTS (
SELECT * FROM student_course nx2
WHERE nx2.student_id = p.aaa
AND nx2.course_id = nx1.course_id
)
)
;
The same, without CTE's:
SELECT *
FROM (
SELECT DISTINCT a.student_id AS aaa
, b.student_id AS bbb
FROM student_course a
JOIN student_course b ON a.course_id = b.course_id
) p
WHERE p.aaa < p.bbb
AND NOT EXISTS (
SELECT * FROM student_course nx1
WHERE nx1.student_id = p.aaa
AND NOT EXISTS (
SELECT * FROM student_course nx2
WHERE nx2.student_id = p.bbb
AND nx2.course_id = nx1.course_id
)
)
AND NOT EXISTS (
SELECT * FROM student_course nx1
WHERE nx1.student_id = p.bbb
AND NOT EXISTS (
SELECT * FROM student_course nx2
WHERE nx2.student_id = p.aaa
AND nx2.course_id = nx1.course_id
)
)
;
The non-CTE version is faster, obviously.
Upvotes: 1
Reputation: 656754
I created a somewhat realistic test case:
CREATE TEMP TABLE student_course (
student_id integer
,course_id integer
,PRIMARY KEY (student_id, course_id)
);
INSERT INTO student_course
SELECT *
FROM (VALUES (1, 1), (1, 2), (1, 3), (2, 1), (3, 1), (3, 2), (3, 3)) v
-- to include some non-random values in test
UNION ALL
SELECT DISTINCT student_id, normal_rand((random() * 30)::int, 1000, 35)::int
FROM generate_series(4, 5000) AS student_id;
DELETE FROM student_course WHERE random() > 0.9; -- create some dead tuples
ANALYZE student_course; -- needed for temp table
Note the use of normal_rand()
to populate the dummy table with a normal distribution of values. It's shipped with the tablefunc module, and since i am going to use that further down anyway ...
Also note the bold emphasis on the numbers I am going to manipulate for the benchmark to simulate various test cases.
The question is rather basic and unclear. Find the first two students with matching courses? Or find all? Find couples of them or groups of students sharing the same courses?
Craig answers to:
Find all couples sharing the same courses.
Plain SQL With a CTE and grouping by arrays, slightly formatted:
WITH student_coursearray(student_id, courses) AS (
SELECT student_id, array_agg(course_id ORDER BY course_id)
FROM student_course
GROUP BY student_id
)
SELECT a.student_id, b.student_id
FROM student_coursearray a
JOIN student_coursearray b ON (a.courses = b.courses)
WHERE a.student_id < b.student_id
ORDER BY a.student_id, b.student_id;
The second query in Craig's answer dropped out of the race right away. With more than just a few rows, performance quickly deteriorates badly. The CROSS JOIN
is poison.
There is one major weakness, ORDER BY
per aggregate is a bad performer, so I rewrote with ORDER BY
in a subquery:
WITH cte AS (
SELECT student_id, array_agg(course_id) AS courses
FROM (SELECT student_id, course_id FROM student_course ORDER BY 1, 2) sub
GROUP BY student_id
)
SELECT a.student_id, b.student_id
FROM cte a
JOIN cte b USING (courses)
WHERE a.student_id < b.student_id
ORDER BY 1,2;
I think the generally more useful case is:
Find all students sharing the same courses.
So I return arrays of students with matching courses.
WITH s AS (
SELECT student_id, array_agg(course_id) AS courses
FROM (SELECT student_id, course_id FROM student_course ORDER BY 1, 2) sub
GROUP BY student_id
)
SELECT array_agg(student_id)
FROM s
GROUP BY courses
HAVING count(*) > 1
ORDER BY array_agg(student_id);
To make this generic and fast I wrapped it into a plpgsql function with dynamic SQL:
CREATE OR REPLACE FUNCTION f_same_set(_tbl regclass, _id text, _match_id text)
RETURNS SETOF int[] AS
$func$
BEGIN
RETURN QUERY EXECUTE format(
$f$
WITH s AS (
SELECT %1$I AS id, array_agg(%2$I) AS courses
FROM (SELECT %1$I, %2$I FROM %3$s ORDER BY 1, 2) s
GROUP BY 1
)
SELECT array_agg(id)
FROM s
GROUP BY courses
HAVING count(*) > 1
ORDER BY array_agg(id)
$f$
,_id, _match_id, _tbl
);
END
$func$ LANGUAGE plpgsql;
Call:
SELECT * FROM f_same_set('student_course', 'student_id', 'course_id');
Works for any table with numeric columns. It's trivial to extend for other data types, too.
crosstab()
For a relatively small number of courses
(and arbitrarily big number of students) crosstab()
provided by the additional tablefunc module is another option in PostgreSQL. More general info here:
PostgreSQL Crosstab Query
A simple case for the simple example in the question, much like explained in the linked answer:
SELECT array_agg(student_id)
FROM crosstab('
SELECT student_id, course_id, TRUE
FROM student_course
ORDER BY 1'
,'VALUES (1),(2),(3)'
)
AS t(student_id int, c1 bool, c2 bool, c3 bool)
GROUP BY c1, c2, c3
HAVING count(*) > 1;
For the simple case, the crosstab variant was faster, so I build a plpgsql function with dynamic SQL and included it in the test. Functionally identical with F1.
CREATE OR REPLACE FUNCTION f_same_set_x(_tbl regclass, _id text, _match_id text)
RETURNS SETOF int[] AS
$func$
DECLARE
_ids int[]; -- for array of match_ids (course_id in example)
BEGIN
-- Get list of match_ids
EXECUTE format(
'SELECT array_agg(DISTINCT %1$I ORDER BY %1$I) FROM %2$s',_match_id, _tbl)
INTO _ids;
-- Main query
RETURN QUERY EXECUTE format(
$f$
SELECT array_agg(%1$I)
FROM crosstab('SELECT %1$I, %2$I, TRUE FROM %3$s ORDER BY 1'
,'VALUES (%4$s)')
AS t(student_id int, c%5$s bool)
GROUP BY c%6$s
HAVING count(*) > 1
ORDER BY array_agg(student_id)
$f$
,_id
,_match_id
,_tbl
,array_to_string(_ids, '),(') -- values
,array_to_string(_ids, ' bool,c') -- column def list
,array_to_string(_ids, ',c') -- names
);
END
$func$ LANGUAGE plpgsql;
Call:
SELECT * FROM f_same_set_x('student_course', 'student_id', 'course_id');
I tested on my small PostgreSQL test server.
PostgreSQL 9.1.9 on Debian Linux on an ~ 6 years old AMD Opteron Server. I ran 5 test sets with the above settings and each of the presented queries. Best of 5 with EXPLAIN ANALYZE
.
I used these values for the bold numbers in the above test case to populate:
nr. of students / max. nr. of courses / standard deviation (results in more distinct course_ids)
1. 1000 / 30 / 35
2. 5000 / 30 / 50
3. 10000 / 30 / 100
4. 10000 / 10 / 10
5. 10000 / 5 / 5
C1
1. Total runtime: 57 ms
2. Total runtime: 315 ms
3. Total runtime: 663 ms
4. Total runtime: 543 ms
5. Total runtime: 2345 ms (!) - deteriorates with many pairs
E1
1. Total runtime: 46 ms
2. Total runtime: 251 ms
3. Total runtime: 529 ms
4. Total runtime: 338 ms
5. Total runtime: 734 ms
E2
1. Total runtime: 45 ms
2. Total runtime: 245 ms
3. Total runtime: 515 ms
4. Total runtime: 218 ms
5. Total runtime: 143 ms
F1 victor
1. Total runtime: 14 ms
2. Total runtime: 77 ms
3. Total runtime: 166 ms
4. Total runtime: 80 ms
5. Total runtime: 54 ms
F2
1. Total runtime: 62 ms
2. Total runtime: 336 ms
3. Total runtime: 1053 ms (!) crosstab() deteriorates with many distinct values
4. Total runtime: 195 ms
5. Total runtime: 105 ms (!) but performs well with fewer distinct values
The PL/pgSQL function with dynamic SQL, sorting rows in a subquery is clear victor.
Upvotes: 1
Reputation: 324465
Given sample data:
CREATE TABLE student_course (
student_id integer,
course_id integer,
PRIMARY KEY (student_id, course_id)
);
INSERT INTO student_course (student_id, course_id)
VALUES (1, 1), (1, 2), (1, 3), (2, 1), (3, 1), (3, 2), (3, 3) ;
One option is to use a CTE to join on the ordered lists of courses each student is taking:
WITH student_coursearray(student_id, courses) AS (
SELECT student_id, array_agg(course_id ORDER BY course_id)
FROM student_course
GROUP BY student_id
)
SELECT a.student_id, b.student_id
FROM student_coursearray a INNER JOIN student_coursearray b ON (a.courses = b.courses)
WHERE a.student_id > b.student_id;
array_agg
is actually part of the SQL standard, as is the WITH
common-table expression syntax. Neither are supported by MySQL so you'll have to express this a different way if you want to support MySQL.
Another way to think about this would be "for every student pairing, find out if one is taking a class the other is not". This would lend its self to a FULL OUTER JOIN
, but it's pretty awkward to express. You have to determine the pairings of student IDs of interest, then for each pairing do a full outer join across the set of classes each takes. If there are any null rows then one took a class the other didn't, so you can use that with a NOT EXISTS
filter to exclude such pairings. That gives you this monster:
WITH student_id_pairs(left_student, right_student) AS (
SELECT DISTINCT a.student_id, b.student_id
FROM student_course a
INNER JOIN student_course b ON (a.student_id > b.student_id)
)
SELECT left_student, right_student
FROM student_id_pairs
WHERE NOT EXISTS (
SELECT 1
FROM (SELECT course_id FROM student_course WHERE student_id = left_student) a
FULL OUTER JOIN (SELECT course_id FROM student_course b WHERE student_id = right_student) b
ON (a.course_id = b.course_id)
WHERE a.course_id IS NULL or b.course_id IS NULL
);
The CTE is optional and may be replaced by a CREATE TEMPORARY TABLE AS SELECT ...
or whatever if your DB doesn't support CTEs.
I'm very confident that the array approach will perform better in all cases, particularly because for a really large data set you can take the WITH
expression, create a temporary table from the query instead, add an index on (courses, student_id)
to it and do crazy-fast equality searching that'll well and truly pay off the cost of the index creation time. You can't do that with the subquery joins approach.
Upvotes: 9
Reputation: 7123
select courses,group_concat(studentID) from
(select studentID,
group_concat(courseID order by courseID) as courses
from Table1 group by studentID) abc
group by courses having courses like('%,%');
Upvotes: 2