user3171906
user3171906

Reputation: 583

sql equivalent of R query

I have two data sets author_data and paper_author

author_data:

author_id       author_name          author_affiliation
 25         William H. Nailon                                                                    
 37         P. B. Littlewood        Cavendish Laboratory|Cambridge University
 44         A. Kuroiwa               Department of Molecular Biology 

paper_author:

paper_id     author_id      author_name      author_affiliation
  1          521630         Ayman Kaheel     Cairo Microsoft Innovation Lab
  1          972575       Mahmoud Refaat     Cairo Microsoft Innovation Lab

I have run the following query in R

author_data[which(author_data$author_id %in% paper_author$author_id &
                  author_data$author_name %in% paper_author$author_name & 
                  author_data$author_affiliation %in% paper_author$author_affiliation), ]

That is, I want to find out the matches between author_data and paper_author for which the three columns author_id, author_name and author_affiliation match.

I have written a query to get this result in sql but I am not getting it right.The query which I have tried is

statement <- "select
              author_data.author_id,
              author_data.author_name,
              author_data.author_affiliation
        FROM author_data
        INNER JOIN paper_author
          ON author_data.author_id = paper_author.author_id
            AND author_data.author_name = paper_author.author_name
            AND author_data.author_affiliation = paper_author.author_affiliation"

through this query I am getting more rows than the rows in author_data and the query should fetch data which first of all would be subset of author_data. I am not able to figure out what is wrong as I am naive at sql.

What is wrong with this query?

Thanks

Upvotes: 1

Views: 103

Answers (2)

Tomas Greif
Tomas Greif

Reputation: 22623

There is a difference between which in R and join in SQL. While which will effectively subset given data frame, join will return all rows where join condition is met. I am almost sure, that in your case you have multiple occurences of combination author_id, author_name, author_affiliation in paper_author. As a result, rows in author_data are multiplied by rows in paper_author.

Your query was almost correct, you need to add distinct or group by or use exists:

Distinct:

select
   distinct
   author_data.author_id,
   author_data.author_name,
   author_data.author_affiliation
from
   author_data
   INNER JOIN paper_author
          ON author_data.author_id = paper_author.author_id
            AND author_data.author_name = paper_author.author_name
            AND author_data.author_affiliation = paper_author.author_affiliation

Group by:

select
   author_data.author_id,
   author_data.author_name,
   author_data.author_affiliation
from
   author_data
   INNER JOIN paper_author
          ON author_data.author_id = paper_author.author_id
            AND author_data.author_name = paper_author.author_name
            AND author_data.author_affiliation = paper_author.author_affiliation
group by
   author_data.author_id,
   author_data.author_name,
   author_data.author_affiliation

You can also use exists:

select
   author_data.author_id,
   author_data.author_name,
   author_data.author_affiliation
from
   author_data
where
   exists (select 1 from paper_author where
       author_data.author_id = paper_author.author_id
       AND author_data.author_name = paper_author.author_name
       AND author_data.author_affiliation = paper_author.author_affiliation
       )

Upvotes: 1

Pascal Peregrina
Pascal Peregrina

Reputation: 31

Try this.

SELECT author_data.author_id,author_data.author_name,author_data.author_affiliation
FROM author_data, paper_author
WHERE author_data.author_id = paper_author.author_id 
AND author_data.author_name=paper_author.author_name 
AND author_data.author_affiliation=paper_author.author_affiliation

Upvotes: 0

Related Questions