Doug Robinson
Doug Robinson

Reputation: 59

Rows lost during merge in R

I'm working with df1 that contains 180K rows and 27 factors. I used 3 factors in a merge with df2 (43K obs, 10 factors, 3 factors match with df1). In df1, the rows are observations repeated over months for several years for TrapLines and TrapNum. There are 3393 unique levels associated with TrapLines & TrapNum. In df2, Shade varies by month for TrapLine and Trapnum: for each TrapLine/Trapnum entry, there are 12 values of Shade. In df2, there are 3268 unique levels associated with Trapline/Trapnum. I want to merge (if that's the correct process) df1 and df2 such that the monthly Shade data is associated with the correct Trapline/TrapNum in df1.

df1:

TrapLine    TrapNum Date        Month_num
A24         D101    1-1-2018       1
A24         D101    2-13-2017      2
A24         D102    3-30-2017      3
A24         D102    4-10-2019      4
A24         D103    4-10-2019      4
A24         D103    5-15-2016      6

etc.

df2:

TrapLine    TrapNum Shade      Month    Month_num
A24         D101    217         Jan      1
A24         D101    201         Feb      2
A24         D101    172         Mar      3
A24         D102    145         Mar      3
A24         D102    97          Apr      4
A24         D102    59          May      5

etc.

I used the following code:

df3 <- merge (df1, df2, by c("TrapLine", "TrapNum", "Month_num"))

However, in the process, I lose 40K observations from df1! I'm at a loss to explain what happened: I would've thought that if there weren't matches between TrapLine/TrapNum that the Shade values would've been set to NA or the like, not that observations would be lost altogether. Can someone please help me understand what's happened and what process could be used such that I retain all of the observations (I'll worry about eliminating observations without Shade later!). Thank you for your time and attention. Cheers, Doug

Upvotes: 1

Views: 842

Answers (1)

Daniel R
Daniel R

Reputation: 2042

Use the argument all of the merge function. By default, all=FALSE, so it does a "inner join", and drops all unmatched observations. If you set merge(..., all=TRUE) you have a "outer join" and keep all observations. See https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/merge

Upvotes: 1

Related Questions