Reputation: 2144
I have two data frames, the first contains 9994 rows and the second contains 60431 rows. I want to merge the two data frames such that the merged data frame contains combined columns of both data frames but only contains 9994 rows.
However, I get more than 9994 rows upon merge. How can I make sure this does not happen?
df1 = readRDS('data1.RDS')
nrow(df1)
# [1] 9994
df2 = readRDS('data2.RDS')
nrow(df2)
# [1] 60431
df = merge(df1,df2,by=c("col1","col2"))
nrow(df)
# [1] 10057
df = merge(df1,df2,by=c("col1","col2"),all.x=TRUE)
nrow(df)
# [1] 10057
nrow(na.omit(df))
# [1] 10057
EDIT : Following akrun's comment. Yes, there were duplicates in the second data frame
nrow(unique(df2[,c("col1","col2")]))
# [1] 60263
nrow(df2)
# [1] 60431
How can I take only one row from a data frame if there are multiple for the same {col1,col2} combination. When I merge, I would like to have only 9994 rows.
Upvotes: 7
Views: 8086
Reputation: 2535
This should work, be sure to sort df2
first so you select the right rows.
df = merge(
df1,
df2[!duplicated(df2[, c("col1","col2")]), ],
by=c("col1","col2"),
all.x=TRUE
)
What happens here: I merge the two data frames by the columns we want to merge by, but I first select only the first occurrence of any combination of col1
and col2
from the second data.frame
df2
.
duplicated
checks if lines are duplicated if called with a data.frame
. I select col1
and col2
from df2
, so duplicated
returns TRUE
for rows with the same col1
and col2
but differences in other cols. Then I select only the rows which are not duplicated.
(Read the [
-expressions carefully, and check the function calls from the inside out, to get the intermediate results)
edit: added explanation as suggested in comments
Upvotes: 3