Patryk
Patryk

Reputation: 24092

How to eliminate cartesian product with joins while using subquery?

I have the following database:

 paperid | authorid | name
---------+----------+---------------
 1889374 |   897449 | D. N. Page
 1889374 |  1795881 | C. N. Pope
 1889374 |  1952069 | S. W. Hawking

I would like to create a table having following columns:

The result should look like this:

 paperid |    author     |          coauthors          
---------+---------------+---------------------------
 1889374 | D. N. Page    |  C. N. Pope S. W. Hawking
 1889374 | C. N. Pope    | D. N. Page  S. W. Hawking
 1889374 | S. W. Hawking | D. N. Page C. N. Pope 

and this is achieved with following queries:

SELECT  foo.paperid, npa.name as author, foo.coauthors
INTO npatest
FROM newpaperauthor npa
CROSS JOIN (
   SELECT paperid, string_agg(name, ' ') as coauthors
   FROM newpaperauthor
   GROUP BY paperid
   ORDER BY paperid) foo;
UPDATE npatest SET coauthors = regexp_replace(coauthors, author, '');
SELECT * FROM npatest;

The problem arise when there's more paperids in the database like to :

 paperid | authorid |       name       |      affiliation       
---------+----------+------------------+------------------------
 1889373 |   122817 | Kazuhiro Hongo   | 
 1889373 |  1091191 | Hiroshi NAKAGAWA | 
 1889373 |  1874415 | Hiroshi Nakagawa | University of Oklahoma
 1889373 |  2149773 | Han Soo Chang    | 
 1889374 |   897449 | D. N. Page       | 
 1889374 |  1795881 | C. N. Pope       | 
 1889374 |  1952069 | S. W. Hawking    | 

Then I will get a cartesian product of them like:

 paperid |      author      |                           coauthors                            
---------+------------------+----------------------------------------------------------------
 1889373 | Kazuhiro Hongo   |  Hiroshi NAKAGAWA Hiroshi Nakagawa Han Soo Chang
 1889374 | Kazuhiro Hongo   | D. N. Page C. N. Pope S. W. Hawking
 1889373 | Hiroshi NAKAGAWA | Kazuhiro Hongo  Hiroshi Nakagawa Han Soo Chang
 1889374 | Hiroshi NAKAGAWA | D. N. Page C. N. Pope S. W. Hawking
 1889373 | Hiroshi Nakagawa | Kazuhiro Hongo Hiroshi NAKAGAWA  Han Soo Chang
 1889374 | Hiroshi Nakagawa | D. N. Page C. N. Pope S. W. Hawking
 1889373 | Han Soo Chang    | Kazuhiro Hongo Hiroshi NAKAGAWA Hiroshi Nakagawa 
 1889374 | Han Soo Chang    | D. N. Page C. N. Pope S. W. Hawking
 1889373 | D. N. Page       | Kazuhiro Hongo Hiroshi NAKAGAWA Hiroshi Nakagawa Han Soo Chang
 1889374 | D. N. Page       |  C. N. Pope S. W. Hawking
 1889373 | C. N. Pope       | Kazuhiro Hongo Hiroshi NAKAGAWA Hiroshi Nakagawa Han Soo Chang
 1889374 | C. N. Pope       | D. N. Page  S. W. Hawking
 1889373 | S. W. Hawking    | Kazuhiro Hongo Hiroshi NAKAGAWA Hiroshi Nakagawa Han Soo Chang
 1889374 | S. W. Hawking    | D. N. Page C. N. Pope 

How to get rid of that cartesian product there ?

Upvotes: 3

Views: 2421

Answers (3)

Erwin Brandstetter
Erwin Brandstetter

Reputation: 656706

This can be surprisingly simple with array_agg() as window aggregate function combined with array_remove() (introduced with pg 9.3):

CREATE TABLE npatest AS
SELECT paperid, name AS author
     , array_to_string(array_remove(array_agg(name) OVER (PARTITION BY paperid), name), ', ') AS coauthors
FROM   newpaperauthor n;

If author names are not unique, there are complications.
Then again, if author names are not unique, your whole operation is flawed.

Using array_agg() and array_remove() instead of string_agg() and regexp_replace(), because the latter would fail easily for similar names like 'Jon Fox' and 'Jon Foxy', and also be messy with delimiters.

array_to_string() transforms the array to a string. I used ', ' as separator, which seems more sensible to me than just a space.

The use of SELECT INTO is discouraged. Use the superior CREATE TABLE AS instead. Per documentation:

CREATE TABLE AS is the recommended syntax, since this form of SELECT INTO is not available in ECPG or PL/pgSQL, because they interpret the INTO clause differently. Furthermore, CREATE TABLE AS offers a superset of the functionality provided by SELECT INTO.

SQL Fiddle.

Upvotes: 2

joop
joop

Reputation: 4503

@GordonLinoff 's query can be simplified a bit by suppressing the 1st author in the aggregate:

SELECT DISTINCT
        p0.paperid , p0.authorid , p0.name as name1
        , string_agg(p1.name, ', ' ) AS others
FROM papers p0
JOIN papers p1 ON p1.paperid = p0.paperid AND p1.authorid <> p0.authorid
GROUP BY p0.paperid, p0.authorid, p0.name
ORDER BY p0.paperid, p0.authorid
        ;

Upvotes: 0

Gordon Linoff
Gordon Linoff

Reputation: 1269753

Here is a way to approach this problem:

Generate the list of all co-authors as a subquery. Generate the list of all authors. Then join these together and do the string manipulation to get what you want.


The authors is easy:

select paperid, npa.name as author
from newpaperauthor npa;

The co-authors is easy:

select paperid, string_agg(npa.name, ' ') as coauthors
from newpaperauthor npa
group by paperid;

The combination requires some list substitution:

select a.paperid, a.author,
       replace(replace(coauthors, author, ''), '  ', ' ') as coauthors
from (select paperid, npa.name as author
      from newpaperauthor npa
     ) a join
     (select paperid, string_agg(npa.name, ' ') as coauthors
      from newpaperauthor npa
      group by paperid
     ) ca
     on a.paperid = ca.paperid;

Upvotes: 3

Related Questions