Rafael Sierra
Rafael Sierra

Reputation: 117

Merge function generates duplicates

This question is just to understand why this would happen.

I'm merging two databases:

bot.rep.geo <- merge(x = bot.rep, y = geo.2016, by = "cod.geo", all.x = TRUE)

The original databases have the following dimensions: bot.rep has 1634451 observations, geo.2016 has 1393.

After merging using all.x = TRUE, the new database emerges with 1727681, instead of the same size as bot.rep.

Why does this happen?

After a quick review, I realised it was creating some duplicates, but I don't understand the reason and if I'm doing something wrong while using the merge function.

Upvotes: 2

Views: 390

Answers (2)

zx8754
zx8754

Reputation: 56149

This happens because of one-to-many relationship, x has multiple rows matching in y.

See example below, where bot.rep cod.geo value of 1 has 2 matches in geo.2016 dataset. Hence, we have 2 rows for 1 id. Also, notice we are creating NA rows for non matched ids because of all.x = TRUE argument.

Now, you need to decide which row is a duplicate for cod.geo value 1.

#dummy data
bot.rep <- data.frame(cod.geo = 1:4)
geo.2016 <- data.frame(cod.geo = c(1,1,3,5,6), z = 1:5)

bot.rep.geo <- merge(x = bot.rep, y = geo.2016,
                     by = "cod.geo", all.x = TRUE)

#   cod.geo  z
# 1       1  1
# 2       1  2
# 3       2 NA
# 4       3  3
# 5       4 NA

You will find more info on different types of merge functions here.

Upvotes: 1

user1923975
user1923975

Reputation: 1389

There may be lines in the geo.2016 table where the cod.geo value appears twice or more.

if you have a bot.rep value of "X" in your bot.rep data, then 2 lines which contain "X" in the geo.2016 data, the merge will duplicate the line in bot.rep and join the 2 lines from geo.2016.

Upvotes: 1

Related Questions