R - remove duplicate rows (based on 2 columns, regardless of order) in data table

Question

There are some similar questions here on SO, but none seemed to apply to my specific data format, therefore I decided to post my question.

My data table looks like this:

year  field fert_name.x   yield.x fert_name.y   yield.y
 1: 2017 field1       fertA  7.875775       fertB 14.404673
 2: 2017 field1       fertA  7.875775       fertC 10.514350
 3: 2017 field1       fertB 14.404673       fertA  7.875775
 4: 2017 field1       fertB 14.404673       fertC 10.514350
 5: 2017 field1       fertC 10.514350       fertA  7.875775
 6: 2017 field1       fertC 10.514350       fertB 14.404673
 7: 2017 field2       fertA  9.089769       fertB 10.281055
 8: 2017 field2       fertA  9.089769       fertC 14.568333
 9: 2017 field2       fertB 10.281055       fertA  9.089769
10: 2017 field2       fertB 10.281055       fertC 14.568333
11: 2017 field2       fertC 14.568333       fertA  9.089769
12: 2017 field2       fertC 14.568333       fertB 10.281055
13: 2018 field1       fertA 12.883051       fertB  5.455565
14: 2018 field1       fertA 12.883051       fertC  9.566147
15: 2018 field1       fertB  5.455565       fertA 12.883051
16: 2018 field1       fertB  5.455565       fertC  9.566147
17: 2018 field1       fertC  9.566147       fertA 12.883051
18: 2018 field1       fertC  9.566147       fertB  5.455565
19: 2018 field2       fertA 13.830174       fertB 13.924190
20: 2018 field2       fertA 13.830174       fertC  9.533342
21: 2018 field2       fertB 13.924190       fertA 13.830174
22: 2018 field2       fertB 13.924190       fertC  9.533342
23: 2018 field2       fertC  9.533342       fertA 13.830174
24: 2018 field2       fertC  9.533342       fertB 13.924190

and I would like to remove all rows where, for each year and field, fert_name.x and fert_name.y are duplicates. Examples of such cases are rows 1 and 3 (fertA/fertB and fertB/fertA), rows 2 and 5 (fertA/fertC and fertC/fertA), and so on.

What is the best way to do this?

Steps to reproduce the dataset:

library(data.table)
# create some fake data
dat <- data.table(expand.grid(year=2017:2018,
                              field=c("field1","field2"),
                              fert_name=c("fertA","fertB","fertC"),
                              stringsAsFactors=F))
set.seed(123); yld <- runif(12, min=5, max=15)
dat$yield <- yld
(dat.m <- merge(dat, dat, by=c("year","field"), allow.cartesian=T)[fert_name.x != fert_name.y])

r2evans · Accepted Answer

I think there are no rows in your current frame that are retained, since you have a full "perfect" expansion of all pairs. However, let's alter that a bit to make sure we have some data to retain:

dat.m <- dat.m[-c(3,5),]

With that, we can provide a "sorted" selection of columns so that we can deal with different ordering of your fert_name.* columns:

dat.m[, `:=`(
    fert1 = pmin(fert_name.x, fert_name.y),
    fert2 = pmax(fert_name.x, fert_name.y))
  ][, .SD[.N == 1,], by = .(year, field, fert1, fert2)]
#     year  field  fert1  fert2 fert_name.x  yield.x fert_name.y  yield.y
#                           
# 1:  2017 field1  fertA  fertB       fertA 7.875775       fertB 14.40467
# 2:  2017 field1  fertA  fertC       fertA 7.875775       fertC 10.51435

R - remove duplicate rows (based on 2 columns, regardless of order) in data table

Answers (1)

Related Questions