Rachie
Rachie

Reputation: 443

Finding collaborations in MySQL

I have a database of people and projects. How can I find the names of people who collaborated with a given person, and on how many projects?

For example, I want to find the collaborators of Jimmy from the database:

+----------+--------+
| project  | person |
+----------+--------+
| datamax  | Jimmy  |
| datamax  | Ashley |
| datamax  | Martin |
| cocoplus | Jimmy  |
| cocoplus | Ashley |
| glassbox | Jimmy  |
| glassbox | Martin |
| powerbin | Jimmy  |
| powerbin | Ashley |
+----------+--------+

The result would look something like this:

Jimmy's collaborations:
+--------+----------------+
| person | collaborations |
+--------+----------------+
| Ashley | 3              |
| Martin | 2              |
+--------+----------------+

Upvotes: 0

Views: 231

Answers (1)

Ruslan Osmanov
Ruslan Osmanov

Reputation: 21502

Join the table with itself, group by the person field:

SELECT u2.person, COUNT(u1.project) AS collaborations
FROM users u1
JOIN users u2 ON u2.project = u1.project
WHERE u1.person != u2.person AND u1.person = 'Jimmy'
GROUP BY u2.person;

The query selects the projects in which Jimmy participated from u1. The rows from u2 are filtered by the rows from u1. Duplicate entries, where the users from both tables match, are filtered with WHERE clause. Finally, the result set is grouped by person, and the COUNT function calculates the number of rows per group.

Performance

Note, an index for person and project columns (or two separate indexes) will significantly improve performance of the query above. Specific index configuration depends on the table structure. Although, I think the following is quite enough for a table with two varchar fields for person and project, for instance:

ALTER TABLE users ADD INDEX `project` (`project`(10));
ALTER TABLE users ADD INDEX `person` (`person`(10));

Normalization

However, I would rather store persons and projects in separate tables with their numeric IDs. A third table could play the role of connector: person_id - project_id. In other words, I recommend normalization. With normalized tables, you will not need to build bloated indexes for the text fields.

Normalized tables may look as follows:

CREATE TABLE users (
  id int unsigned NOT NULL AUTO_INCREMENT,
  name varchar(200) NOT NULL DEFAULT '',
  PRIMARY KEY(`id`),
  -- This index is needed, if you want to fetch users by names
  INDEX name (name(8))
);
CREATE TABLE projects (
  id int unsigned NOT NULL AUTO_INCREMENT,
  name varchar(100) NOT NULL DEFAULT '',
  PRIMARY KEY(`id`)
);
CREATE TABLE collaborations (
  project_id int unsigned NOT NULL DEFAULT 0,
  user_id int unsigned NOT NULL DEFAULT 0,
  PRIMARY KEY(`project_id`, `user_id`)
);

The query for the normalized structures will look a little bit more complex:

-- In practice, the user ID is retrieved from the calling process
-- (such as POST/GET HTTP requests, for instance).
SET @user_id := (SELECT id FROM users WHERE name LIKE 'Jimmy');

SELECT u.name person, COUNT(p.id) collaborations
FROM collaborations c
JOIN collaborations c2 USING(project_id)
JOIN users u ON u.id = c2.user_id
JOIN projects p ON p.id = c2.project_id
WHERE c.user_id = @user_id AND c.user_id != c2.user_id
GROUP BY c2.user_id;

But it will be fast, and the space required for the indexes will be significantly smaller, especially for large data sets.

Original answer

To fetch the total number of projects for each person, use COUNT function with GROUP BY clause:

SELECT person, COUNT(*) AS collaborations
FROM users
GROUP BY person;

Upvotes: 2

Related Questions