user3171906
user3171906

Reputation: 583

sql query for finding multiple duplicate pairs in columns

I have a table as below:

paper_id  author_id     author_name         author_affiliation
    1     521630         Ayman Kaheel        Cairo Microsoft Innovation Lab
    1     972575       Mahmoud Refaat       Cairo Microsoft Innovation Lab
    3    1528710     Ahmed Abdul-hamid      Harvard

Now I am finding multiple pairs of author_id, author_name and author_affiliation occuring together. For example:

author_id     author_name     author_affiliation  count
   1          Masuo Fukui               <NA>       4
   4          Yasusada Yamada           <NA>       8

I am using the following query:

statement<-"select author_id,author_name,author_affiliation,count(*)
        from paper_author 
        GROUP BY author_id,author_name,author_affiliation
        HAVING (COUNT(*)>1)" 

Now I want to know how many author_ids are present in this. I am doing this:

statement<-"select distinct author_id 
    from paper_author 
     where author_id in (
        select author_id,author_name,author_affiliation,count(*)
        from paper_author 
        GROUP BY author_id,author_name,author_affiliation
        HAVING (COUNT(*)>1)
    )" 

I am not able to get the desired result.

Also, how can I get the number of paper ids in the above result?

Thanks.

Upvotes: 0

Views: 1242

Answers (3)

Thorsten Kettner
Thorsten Kettner

Reputation: 94914

Here is your query slightly re-written. You don't need an IN clause. You can select directly from your result set.

select distinct author_id 
from 
(
  select author_id
  from paper_author 
  group by author_id,author_name,author_affiliation
  having count(*) > 1
);

Upvotes: 0

Patrick
Patrick

Reputation: 32224

If you just want to know how many authors have more than one paper, use this query:

SELECT COUNT(*) 
FROM (SELECT author_id, author_affiliation, COUNT(*)
      FROM paper_author 
      GROUP BY author_id, author_affiliation
      HAVING COUNT(*) > 1);

This assumes that author_id is a unique identifier for author_name. If the id selects for the author_name, author_affiliation combination (i.e. an author producing papers for different institutions has multiple id's, one for each affiliation) then you can also strike author_affiliation from the sub-query.

Upvotes: 0

PeterRing
PeterRing

Reputation: 1797

This will do the job, i think:

statement<-"select distinct author_id 
    from paper_author 
     where author_id in (
        select author_id
        from paper_author 
        GROUP BY author_id,author_name,author_affiliation
        HAVING (COUNT(*)>1)
    )" 

Upvotes: 1

Related Questions