Reputation: 2091
I am trying to simplify an analysis that used two SQL
queries, down to one. In doing so, I joined biomass data to the size class data within the single SQL
query, which creates duplicates. This is because biomass is already a sum, and is the total biomass for taxa_name
within each site
, i.e. it is a one-to-many value in my new table.
To get away from 2 SQL
queries, I've done the work in two data.table
manipulations and a join at the end. An alternative is to do calculations and remove duplicates twice. Is there a way to avoid either of these just by using data.table
?
testdf <- structure(list(spcode = c(10008L, 10008L, 10002L, 10002L, 10006L, 10008L, 10008L, 10002L, 10002L, 10011L, 10002L, 10002L, 10006L, 10006L, 10006L), abundance = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 4L, 2L), biomass = c(0.2, 0.2, 0.1, 0.1, 0.1, 0.3, 0.3, 0.1, 0.1, 0.5, 0.1, 0.1, 0.5, 0.5, 0.5), size_class = c(21L, 20L, 14L, 10L, 14L, 21L, 23L, 16L, 13L, 17L, 12L, 5L, 9L, 10L, 11L), site = c(907L, 907L, 907L, 907L, 907L, 914L, 914L, 914L, 914L, 914L, 910L, 910L, 910L, 910L, 910L), taxa_name = c("Hippoglossina stomata", "Hippoglossina stomata", "Symphurus atricaudus", "Symphurus atricaudus", "Microstomus pacificus", "Hippoglossina stomata", "Hippoglossina stomata", "Symphurus atricaudus", "Symphurus atricaudus", "Parophrys vetulus", "Symphurus atricaudus", "Symphurus atricaudus", "Microstomus pacificus", "Microstomus pacificus", "Microstomus pacificus"), lnXabun = c(21L, 20L, 14L, 10L, 14L, 21L, 23L, 16L, 26L, 17L, 12L, 5L, 9L, 40L, 22L)), row.names = c(NA, -15L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x00362498>)
# biomass
bm <- testdf
bm <- bm[, .(site = unique(site)),
by = list(spcode, taxa_name, biomass)][, totbm := sum(biomass), by = list(spcode)][!duplicated(spcode), c(1,5)]
> bm
spcode totbm
1: 10008 0.5
2: 10002 0.3
3: 10006 0.6
4: 10011 0.5
Abundance is done next, then the two are joined on spcode
.
# abundance
testdf <- testdf[, .(totabn = sum(lnXabun), n = sum(abundance), minlngth = min(size_class), maxlngth = max(size_class)),
by = list(spcode, taxa_name)]
# join
testdf[bm, on = 'spcode', bm := i.totbm]
> testdf
spcode taxa_name totabn n minlngth maxlngth bm
1: 10008 Hippoglossina stomata 85 4 20 23 0.5
2: 10002 Symphurus atricaudus 83 7 5 16 0.3
3: 10006 Microstomus pacificus 85 8 9 14 0.6
4: 10011 Parophrys vetulus 17 1 17 17 0.5
The above output of testdf
is my desired output. My other attempt relies on two !duplicated
calls. In my head, I want to be able to use [, totbm := sum(biomass), by = list(unique(site), spcode)]
, within the abundance calculation, but that doesn't work.
testdf[, .(site = (site), biomass = biomass, totabn = sum(lnXabun), n = sum(abundance), minlngth = min(size_class), maxlngth = max(size_class)), by = list(spcode, taxa_name)][, totbm := sum(biomass), by = list(unique(site), spcode)]
Error in `[.data.table`(testdf[, .(site = (site), biomass = biomass, totabn = sum(lnXabun), : The items in the 'by' or 'keyby' list are length (3,15). Each must be length 15; the same length as there are rows in x (after subsetting if i is provided).
Alternative method:
alt <- bm[, .(site = site, taxa_name = taxa_name, biomass = biomass, totabn = sum(lnXabun), n = sum(abundance), minlngth = min(size_class), maxlngth = max(size_class)),
by = list(spcode)]
alt <- alt[!duplicated(alt, by = c("site", "spcode"))]
alt[, totbm := sum(biomass), by = list(spcode)]
alt[!duplicated(alt, by = "spcode"), c(1,3,5:9)]
Upvotes: 2
Views: 66
Reputation: 344
Like I mentioned in my comment, I'm not a fan of the table with the data redundancies, but here is one way to solve the problem. Basically, instead of using some sort of 'unique' function, give in index number by group of site/taxa_name so that you can set all but the first biomass value to 0. Then a sum by spcode/taxa_name should work fine. This of course assumes that a set of site/taxa_name values corresponds to exactly one biomass value.
testdf <- data.table(spcode = c(10008L, 10008L, 10002L, 10002L, 10006L, 10008L, 10008L, 10002L, 10002L, 10011L, 10002L, 10002L, 10006L, 10006L, 10006L),
abundance = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 4L, 2L),
biomass = c(0.2, 0.2, 0.1, 0.1, 0.1, 0.3, 0.3, 0.1, 0.1, 0.5, 0.1, 0.1, 0.5, 0.5, 0.5),
size_class = c(21L, 20L, 14L, 10L, 14L, 21L, 23L, 16L, 13L, 17L, 12L, 5L, 9L, 10L, 11L),
site = c(907L, 907L, 907L, 907L, 907L, 914L, 914L, 914L, 914L, 914L, 910L, 910L, 910L, 910L, 910L),
taxa_name = c("Hippoglossina stomata", "Hippoglossina stomata", "Symphurus atricaudus", "Symphurus atricaudus", "Microstomus pacificus", "Hippoglossina stomata", "Hippoglossina stomata", "Symphurus atricaudus", "Symphurus atricaudus", "Parophrys vetulus", "Symphurus atricaudus", "Symphurus atricaudus", "Microstomus pacificus", "Microstomus pacificus", "Microstomus pacificus"),
lnXabun = c(21L, 20L, 14L, 10L, 14L, 21L, 23L, 16L, 26L, 17L, 12L, 5L, 9L, 40L, 22L))
testdf[, biomassIdx := 1:.N, by = c('site', 'taxa_name')]
testdf[biomassIdx > 1, biomass := 0]
testdf[, .(tatabn = sum(lnXabun), n = sum(abundance), minlngth = min(size_class), maxlngth = max(size_class) , bm = sum(biomass)),
by = list(spcode, taxa_name)]
Upvotes: 3
Reputation: 5049
Unless I'm missing something, you're complicating yourself a bit. Just do a distinct and a summary:
bm <- testdf[, .SD[1L], by = list(spcode, taxa_name, biomass, site) # distinct
][, .(totbm = sum(biomass)), by = "spcode"] # summary
Upvotes: 1