Gabriel
Gabriel

Reputation: 443

R data table merge vs Stata merge

I am comparing R data.table merge with Stata's merge command.

Say we have two datasets, X and Y. Both have a common key id.

Stata has 1:1 merge which means datasets will be merged one to one on a unique id. If we are using dataset X in Stata and wish to merge with Y, "master" is X and "using" is Y.

merge 1:1 id using Y, options

Stata further gives you a choice of as they called it "matched result". Basically it is a choice of 3 values.

Numeric Code    Description
           1    observation appeared in X only
           2    observation appeared in Y only
           3    observation appeared in both only

I believe that in R data.table, a 1:1 merge it is:

For "merged result" numeric code 1, you do LEFT JOIN. Then any result not appearing in X will be given NA. So we need to do

Y[X, on=.(id)] or merge(X,Y, all.x=TRUE, on="id")

For "merged result" numeric code 2, you do RIGHT JOIN. Then any result not appearing in Y will be given NA. So we need to do

X[Y, on=.(id)] or merge(X,Y, all.y=TRUE, on="id")

For "merged result" numeric code 3, you do INNER JOIN. It only displays data that exist in both X and Y.

X[Y, nomatch=0, on=.(id)] or merge(X,Y, all=FALSE, on="id")

Is this correct? I am comfortable with Stata choice 3 which is basically INNER JOIN. But I am confused with choices 1 and 2. If you further do na.omit(), it will revert to choice 3. Is that correct?

And what can I do in R for 1:m or m:1 one to many and vice versa?

FYI for Stata manual:

https://www.stata.com/manuals/dmerge.pdf 

For data.table merge:

https://rstudio-pubs-static.s3.amazonaws.com/52230_5ae0d25125b544caab32f75f0360e775.html

Upvotes: 4

Views: 3533

Answers (1)

csgroen
csgroen

Reputation: 2541

I'm not very familiar with Stata, but I think I can explain the merge (from base, which data.table adapted)

Let's say you have these two data frames:

A <- data.frame(id = c(1,2,4,5),
                obs1 = c(100, 120, 500, 70))

B <- data.frame(id = c(1,3,4,5),
                obs2 = c("a", "c", "t", "r"))

You can join them in these ways:

Inner join

merge(A, B)

You'll have only the intersects of A and B, e.g. ids 1, 4 and 5.

Left join

merge(A, B, all.x = TRUE)

You'll have the intersect info, plus whatever is present in A but not B. e.g. 1, 2, 4 and 5

Right join

merge(A, B, all.y = TRUE)

Here, you'll have the intersect info, plus whatever is in B but not A. e.g. ids 1, 3, 4 and 5

Full join

merge(A, B, all.x = TRUE, all.y = TRUE)

Here, you'll have the union of ids present in A and B. e.g. all ids, 1-5.


Alternatively, take a look at dplyr's join functions. They have inner_join, left_join, right_join, etc.

Upvotes: 2

Related Questions