Reputation: 583
I have two data sets author_data and paper_author
author_data:
author_id author_name author_affiliation
25 William H. Nailon
37 P. B. Littlewood Cavendish Laboratory|Cambridge University
44 A. Kuroiwa Department of Molecular Biology
paper_author:
paper_id author_id author_name author_affiliation
1 521630 Ayman Kaheel Cairo Microsoft Innovation Lab
1 972575 Mahmoud Refaat Cairo Microsoft Innovation Lab
I have run the following query in R
author_data[which(author_data$author_id %in% paper_author$author_id &
author_data$author_name %in% paper_author$author_name &
author_data$author_affiliation %in% paper_author$author_affiliation), ]
That is, I want to find out the matches between author_data and paper_author for which the three columns author_id
, author_name
and author_affiliation
match.
I have written a query to get this result in sql but I am not getting it right.The query which I have tried is
statement <- "select
author_data.author_id,
author_data.author_name,
author_data.author_affiliation
FROM author_data
INNER JOIN paper_author
ON author_data.author_id = paper_author.author_id
AND author_data.author_name = paper_author.author_name
AND author_data.author_affiliation = paper_author.author_affiliation"
through this query I am getting more rows than the rows in author_data and the query should fetch data which first of all would be subset of author_data. I am not able to figure out what is wrong as I am naive at sql.
What is wrong with this query?
Thanks
Upvotes: 1
Views: 103
Reputation: 22623
There is a difference between which
in R and join
in SQL. While which
will effectively subset given data frame, join
will return all rows where join
condition is met. I am almost sure, that in your case you have multiple occurences of combination author_id, author_name, author_affiliation
in paper_author
. As a result, rows in author_data
are multiplied by rows in paper_author
.
Your query was almost correct, you need to add distinct
or group by
or use exists
:
Distinct:
select
distinct
author_data.author_id,
author_data.author_name,
author_data.author_affiliation
from
author_data
INNER JOIN paper_author
ON author_data.author_id = paper_author.author_id
AND author_data.author_name = paper_author.author_name
AND author_data.author_affiliation = paper_author.author_affiliation
Group by:
select
author_data.author_id,
author_data.author_name,
author_data.author_affiliation
from
author_data
INNER JOIN paper_author
ON author_data.author_id = paper_author.author_id
AND author_data.author_name = paper_author.author_name
AND author_data.author_affiliation = paper_author.author_affiliation
group by
author_data.author_id,
author_data.author_name,
author_data.author_affiliation
You can also use exists
:
select
author_data.author_id,
author_data.author_name,
author_data.author_affiliation
from
author_data
where
exists (select 1 from paper_author where
author_data.author_id = paper_author.author_id
AND author_data.author_name = paper_author.author_name
AND author_data.author_affiliation = paper_author.author_affiliation
)
Upvotes: 1
Reputation: 31
Try this.
SELECT author_data.author_id,author_data.author_name,author_data.author_affiliation
FROM author_data, paper_author
WHERE author_data.author_id = paper_author.author_id
AND author_data.author_name=paper_author.author_name
AND author_data.author_affiliation=paper_author.author_affiliation
Upvotes: 0