Reputation: 443
I am comparing R data.table merge with Stata's merge command.
Say we have two datasets, X and Y. Both have a common key id
.
Stata has 1:1 merge
which means datasets will be merge
d one to one on a unique id. If we are using dataset X in Stata and wish to merge with Y, "master" is X and "using" is Y.
merge 1:1 id using Y, options
Stata further gives you a choice of as they called it "matched result". Basically it is a choice of 3 values.
Numeric Code Description
1 observation appeared in X only
2 observation appeared in Y only
3 observation appeared in both only
I believe that in R data.table, a 1:1 merge it is:
For "merged result" numeric code 1, you do LEFT JOIN. Then any result not appearing in X will be given NA. So we need to do
Y[X, on=.(id)] or merge(X,Y, all.x=TRUE, on="id")
For "merged result" numeric code 2, you do RIGHT JOIN. Then any result not appearing in Y will be given NA. So we need to do
X[Y, on=.(id)] or merge(X,Y, all.y=TRUE, on="id")
For "merged result" numeric code 3, you do INNER JOIN. It only displays data that exist in both X and Y.
X[Y, nomatch=0, on=.(id)] or merge(X,Y, all=FALSE, on="id")
Is this correct? I am comfortable with Stata choice 3 which is basically INNER JOIN. But I am confused with choices 1 and 2. If you further do na.omit(), it will revert to choice 3. Is that correct?
And what can I do in R for 1:m or m:1 one to many and vice versa?
FYI for Stata manual:
https://www.stata.com/manuals/dmerge.pdf
For data.table merge:
https://rstudio-pubs-static.s3.amazonaws.com/52230_5ae0d25125b544caab32f75f0360e775.html
Upvotes: 4
Views: 3533
Reputation: 2541
I'm not very familiar with Stata, but I think I can explain the merge (from base, which data.table adapted)
Let's say you have these two data frames:
A <- data.frame(id = c(1,2,4,5),
obs1 = c(100, 120, 500, 70))
B <- data.frame(id = c(1,3,4,5),
obs2 = c("a", "c", "t", "r"))
You can join them in these ways:
Inner join
merge(A, B)
You'll have only the intersects of A and B, e.g. ids 1, 4 and 5.
Left join
merge(A, B, all.x = TRUE)
You'll have the intersect info, plus whatever is present in A but not B. e.g. 1, 2, 4 and 5
Right join
merge(A, B, all.y = TRUE)
Here, you'll have the intersect info, plus whatever is in B but not A. e.g. ids 1, 3, 4 and 5
Full join
merge(A, B, all.x = TRUE, all.y = TRUE)
Here, you'll have the union of ids present in A and B. e.g. all ids, 1-5.
Alternatively, take a look at dplyr's join functions. They have inner_join
, left_join
, right_join
, etc.
Upvotes: 2