Reputation: 24092
I have the following database:
paperid | authorid | name
---------+----------+---------------
1889374 | 897449 | D. N. Page
1889374 | 1795881 | C. N. Pope
1889374 | 1952069 | S. W. Hawking
I would like to create a table having following columns:
The result should look like this:
paperid | author | coauthors
---------+---------------+---------------------------
1889374 | D. N. Page | C. N. Pope S. W. Hawking
1889374 | C. N. Pope | D. N. Page S. W. Hawking
1889374 | S. W. Hawking | D. N. Page C. N. Pope
and this is achieved with following queries:
SELECT foo.paperid, npa.name as author, foo.coauthors
INTO npatest
FROM newpaperauthor npa
CROSS JOIN (
SELECT paperid, string_agg(name, ' ') as coauthors
FROM newpaperauthor
GROUP BY paperid
ORDER BY paperid) foo;
UPDATE npatest SET coauthors = regexp_replace(coauthors, author, '');
SELECT * FROM npatest;
The problem arise when there's more paperid
s in the database like to :
paperid | authorid | name | affiliation
---------+----------+------------------+------------------------
1889373 | 122817 | Kazuhiro Hongo |
1889373 | 1091191 | Hiroshi NAKAGAWA |
1889373 | 1874415 | Hiroshi Nakagawa | University of Oklahoma
1889373 | 2149773 | Han Soo Chang |
1889374 | 897449 | D. N. Page |
1889374 | 1795881 | C. N. Pope |
1889374 | 1952069 | S. W. Hawking |
Then I will get a cartesian product of them like:
paperid | author | coauthors
---------+------------------+----------------------------------------------------------------
1889373 | Kazuhiro Hongo | Hiroshi NAKAGAWA Hiroshi Nakagawa Han Soo Chang
1889374 | Kazuhiro Hongo | D. N. Page C. N. Pope S. W. Hawking
1889373 | Hiroshi NAKAGAWA | Kazuhiro Hongo Hiroshi Nakagawa Han Soo Chang
1889374 | Hiroshi NAKAGAWA | D. N. Page C. N. Pope S. W. Hawking
1889373 | Hiroshi Nakagawa | Kazuhiro Hongo Hiroshi NAKAGAWA Han Soo Chang
1889374 | Hiroshi Nakagawa | D. N. Page C. N. Pope S. W. Hawking
1889373 | Han Soo Chang | Kazuhiro Hongo Hiroshi NAKAGAWA Hiroshi Nakagawa
1889374 | Han Soo Chang | D. N. Page C. N. Pope S. W. Hawking
1889373 | D. N. Page | Kazuhiro Hongo Hiroshi NAKAGAWA Hiroshi Nakagawa Han Soo Chang
1889374 | D. N. Page | C. N. Pope S. W. Hawking
1889373 | C. N. Pope | Kazuhiro Hongo Hiroshi NAKAGAWA Hiroshi Nakagawa Han Soo Chang
1889374 | C. N. Pope | D. N. Page S. W. Hawking
1889373 | S. W. Hawking | Kazuhiro Hongo Hiroshi NAKAGAWA Hiroshi Nakagawa Han Soo Chang
1889374 | S. W. Hawking | D. N. Page C. N. Pope
How to get rid of that cartesian product there ?
Upvotes: 3
Views: 2421
Reputation: 656706
This can be surprisingly simple with array_agg()
as window aggregate function combined with array_remove()
(introduced with pg 9.3):
CREATE TABLE npatest AS
SELECT paperid, name AS author
, array_to_string(array_remove(array_agg(name) OVER (PARTITION BY paperid), name), ', ') AS coauthors
FROM newpaperauthor n;
If author names are not unique, there are complications.
Then again, if author names are not unique, your whole operation is flawed.
Using array_agg()
and array_remove()
instead of string_agg()
and regexp_replace()
, because the latter would fail easily for similar names like 'Jon Fox' and 'Jon Foxy', and also be messy with delimiters.
array_to_string()
transforms the array to a string. I used ', '
as separator, which seems more sensible to me than just a space.
The use of SELECT INTO
is discouraged. Use the superior CREATE TABLE AS
instead. Per documentation:
CREATE TABLE AS
is the recommended syntax, since this form ofSELECT INTO
is not available in ECPG or PL/pgSQL, because they interpret theINTO
clause differently. Furthermore,CREATE TABLE AS
offers a superset of the functionality provided bySELECT INTO
.
Upvotes: 2
Reputation: 4503
@GordonLinoff 's query can be simplified a bit by suppressing the 1st author in the aggregate:
SELECT DISTINCT
p0.paperid , p0.authorid , p0.name as name1
, string_agg(p1.name, ', ' ) AS others
FROM papers p0
JOIN papers p1 ON p1.paperid = p0.paperid AND p1.authorid <> p0.authorid
GROUP BY p0.paperid, p0.authorid, p0.name
ORDER BY p0.paperid, p0.authorid
;
Upvotes: 0
Reputation: 1269753
Here is a way to approach this problem:
Generate the list of all co-authors as a subquery. Generate the list of all authors. Then join these together and do the string manipulation to get what you want.
The authors is easy:
select paperid, npa.name as author
from newpaperauthor npa;
The co-authors is easy:
select paperid, string_agg(npa.name, ' ') as coauthors
from newpaperauthor npa
group by paperid;
The combination requires some list substitution:
select a.paperid, a.author,
replace(replace(coauthors, author, ''), ' ', ' ') as coauthors
from (select paperid, npa.name as author
from newpaperauthor npa
) a join
(select paperid, string_agg(npa.name, ' ') as coauthors
from newpaperauthor npa
group by paperid
) ca
on a.paperid = ca.paperid;
Upvotes: 3