Andrew Brēza
Andrew Brēza

Reputation: 8317

Calculating overlap in MySQL

I'm trying to find out which classes have the most overlap between them. The data are stored in MySQL and each student has a completely separate row in the database for each class that he/she takes (I didn't configure it and I can't change it). I pasted a simplified version of the table below. In reality there are around 20 different courses.

CREATE TABLE classes
(`student_id` int, `class` varchar(13));
INSERT INTO classes
(`student_id`, `class`)
VALUES
(55421, 'algebra'),
(27494, 'algebra'),
(64934, 'algebra'),
(65364, 'algebra'),
(21102, 'algebra'),
(90734, 'algebra'),
(20103, 'algebra'),
(57450, 'gym'),
(76411, 'gym'),
(24918, 'gym'),
(65364, 'gym'),
(55421, 'gym'),
(89607, 'world_history'),
(54522, 'world_history'),
(49581, 'world_history'),
(84155, 'world_history'),
(55421, 'world_history'),
(57450, 'world_history');

I eventually want to use Circos (background here) but I'd be happy with any method that allowed me to understand and show people where there is the most and least overlap. This is off the top of my head, but I was thinking that I could use an output table with one row and one column for each course and the number of overlaps listed where different classes intersect. Where each course intersected with itself could show the number of people who have no overlap with any other category.

Screenshot of a 3x3 matrix from Excel

Upvotes: 0

Views: 341

Answers (2)

user6307642
user6307642

Reputation:

You can do that by generating results to represent links : src -> dst = nb

1) Get matrix

select c1.class src_class, c2.class dst_class
from (select distinct class from classes) c1
join (select distinct class from classes) c2
order by src_class, dst_class

The "select distinct class" is not necessary to generate matrix, you can just directly select classes and GROUP BY. But, at step 2 we need that unique results.

Result :

src_class      dst_class
-----------------------------
algebra        algebra
algebra        gym
algebra        world_history
gym            algebra
gym            gym
gym            world_history
world_history  algebra
world_history  gym
world_history  world_history

2) Join list of students that match the source and destination

select c1.class src_class, c2.class dst_class, count(v.student_id) overlap
from (select distinct class from classes) c1
join (select distinct class from classes) c2
left join classes v on
(
    v.class = c1.class
    and v.student_id in (select student_id from classes
                         where class = c2.class)
)
group by src_class, dst_class
order by src_class, dst_class

The distinct values (step 1) allow us to get all classes, even if they are no links (and put 0 instead).

Result :

src_class      dst_class      overlap
-------------------------------------
algebra        algebra           7
algebra        gym               2
algebra        world_history     1
gym            algebra           2
gym            gym               5
gym            world_history     2
world_history  algebra           1
world_history  gym               2
world_history  world_history     6

3 - Make a different calcul if classes are equals

select c1.class src_class, c2.class dst_class, count(v.student_id) overlap
from (select distinct class from classes) c1
join (select distinct class from classes) c2
left join classes v on
(
    v.class = c1.class and
    (
        -- When classes are equals
        -- Students presents only in that class
        (c1.class = c2.class
         and 1 = (select count(*) from classes
                  where student_id = v.student_id))
    or
        -- When classes are differents
        -- Students present in both classes
        (c1.class != c2.class
         and v.student_id in (select student_id from classes
                              where class = c2.class))
    )
)
group by src_class, dst_class
order by src_class, dst_class

Result :

src_class      dst_class      overlap
-------------------------------------
algebra        algebra           5
algebra        gym               2
algebra        world_history     1
gym            algebra           2
gym            gym               2
gym            world_history     2
world_history  algebra           1
world_history  gym               2
world_history  world_history     4

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1270431

Just use a self-join and aggregation:

select c1.class, c2.class, count(*)
from classes c1 join
     classes c2
     on c1.student_id = c2.student_id
group by c1.class, c2.class;

This doesn't produce it in quite the same format.

Upvotes: 1

Related Questions