A. gba
A. gba

Reputation: 15

Combining two dataframes, similar rows, similar columns except one column has different values

I have two dataframes that I want to combine into one. Rows in one dataframe are not always in the other and vice versa. I want to keep all rows and columns from both datasets. Additionally, there is one column ("Tag") that sometimes has different values for the same row of interest ("ID"). My dataset is too big to find them manually. I am having difficulty creating a separate column for when the values in "Tag" are not identical. I've tried test.output1 <-union(test.df1, test.df2,suffix=c(".1",".2")) but get an error due to some of the identical columns having different factor levels. I've also tried variations of: test.output2<-rbind.fill(test.df1, test.df2) but end up with duplicate rows.

Any help is greatly appreciated. Thanks!

Example:

#dataframe 1
test.df1
ID     Year   Location Tag    Length
H1     2013   Site1    272   46
H2     2013   Site2    236   984
H3     2014   Site3    150   68
H4     2014   Site4    698   12
H34    2015   Site1    594   65

#dataframe 3
test.df2
ID     Year   Species Tag    
H1     2013   1       631 
H2     2013   2       236
H3     2014   3       755
H4     2014   4       698
H12    2017   3       135

#What I would like the output to be
test.df.3 
ID     Year   Location  Species Tag.1   Tag.2  Length
H1     2013   Site1     1       272     631    46
H2     2013   Site2     2       236     NA     984
H3     2014   Site3     3       150     755    68
H4     2014   Site4     4       698     NA     12
H12    2017   NA        3       NA      135    NA
H34    2015   Site1     NA      594     NA     65

Upvotes: 0

Views: 694

Answers (2)

r2evans
r2evans

Reputation: 160397

A union or row-join would give you ten rows, I don't think that's what you need. It also won't separate the Tag columns.

I think you need a merge or join operation.

merge(test.df1, test.df2, by = c("ID", "Year"), all=TRUE)
#    ID Year Location Tag.x Length Species Tag.y
# 1  H1 2013    Site1   272     46       1   631
# 2 H12 2017     <NA>    NA     NA       3   135
# 3  H2 2013    Site2   236    984       2   236
# 4  H3 2014    Site3   150     68       3   755
# 5 H34 2015    Site1   594     65      NA    NA
# 6  H4 2014    Site4   698     12       4   698

It's different than you listed in your question, but I think the premise is right.

Edit. Closer to your question, here we update Tag.y to NA where it matches Tag.x.

out <- merge(test.df1, test.df2, by = c("ID", "Year"), all=TRUE)
out <- within(out, { Tag.y[Tag.x == Tag.y] = NA_integer_ })
out
#    ID Year Location Tag.x Length Species Tag.y
# 1  H1 2013    Site1   272     46       1   631
# 2 H12 2017     <NA>    NA     NA       3   135
# 3  H2 2013    Site2   236    984       2    NA
# 4  H3 2014    Site3   150     68       3   755
# 5 H34 2015    Site1   594     65      NA    NA
# 6  H4 2014    Site4   698     12       4    NA

Upvotes: 2

Gallarus
Gallarus

Reputation: 476

Alternatively using dplyr and removing the duplicates so if tag.x == tag.y then keep only tag.x

library(dplyr)    
test.df3 <- full_join(test.df1, test.df2, by = c("ID", "Year")) %>%
  mutate(tag.y = if_else(tag.x == tag.y, NA_integer_, tag.y))

Upvotes: 0

Related Questions