Reputation: 117
This question is just to understand why this would happen.
I'm merging two databases:
bot.rep.geo <- merge(x = bot.rep, y = geo.2016, by = "cod.geo", all.x = TRUE)
The original databases have the following dimensions: bot.rep
has 1634451 observations, geo.2016
has 1393.
After merging using all.x = TRUE
, the new database emerges with 1727681, instead of the same size as bot.rep
.
Why does this happen?
After a quick review, I realised it was creating some duplicates, but I don't understand the reason and if I'm doing something wrong while using the merge
function.
Upvotes: 2
Views: 390
Reputation: 56149
This happens because of one-to-many relationship, x has multiple rows matching in y.
See example below, where bot.rep
cod.geo
value of 1 has 2 matches in geo.2016
dataset. Hence, we have 2 rows for 1 id. Also, notice we are creating NA
rows for non matched ids because of all.x = TRUE
argument.
Now, you need to decide which row is a duplicate for cod.geo
value 1.
#dummy data
bot.rep <- data.frame(cod.geo = 1:4)
geo.2016 <- data.frame(cod.geo = c(1,1,3,5,6), z = 1:5)
bot.rep.geo <- merge(x = bot.rep, y = geo.2016,
by = "cod.geo", all.x = TRUE)
# cod.geo z
# 1 1 1
# 2 1 2
# 3 2 NA
# 4 3 3
# 5 4 NA
You will find more info on different types of merge functions here.
Upvotes: 1
Reputation: 1389
There may be lines in the geo.2016 table where the cod.geo value appears twice or more.
if you have a bot.rep value of "X" in your bot.rep data, then 2 lines which contain "X" in the geo.2016 data, the merge will duplicate the line in bot.rep and join the 2 lines from geo.2016.
Upvotes: 1