Anonymous coward
Anonymous coward

Reputation: 2091

Aggregate data and exclude duplicates in one column

I am trying to simplify an analysis that used two SQL queries, down to one. In doing so, I joined biomass data to the size class data within the single SQL query, which creates duplicates. This is because biomass is already a sum, and is the total biomass for taxa_name within each site, i.e. it is a one-to-many value in my new table.

To get away from 2 SQL queries, I've done the work in two data.table manipulations and a join at the end. An alternative is to do calculations and remove duplicates twice. Is there a way to avoid either of these just by using data.table?

Example data

testdf <- structure(list(spcode = c(10008L, 10008L, 10002L, 10002L, 10006L, 10008L, 10008L, 10002L, 10002L, 10011L, 10002L, 10002L, 10006L, 10006L, 10006L), abundance = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 4L, 2L), biomass = c(0.2, 0.2, 0.1, 0.1, 0.1, 0.3, 0.3, 0.1, 0.1, 0.5, 0.1, 0.1, 0.5, 0.5, 0.5), size_class = c(21L, 20L, 14L, 10L, 14L, 21L, 23L, 16L, 13L, 17L, 12L, 5L, 9L, 10L, 11L), site = c(907L, 907L, 907L, 907L, 907L, 914L, 914L, 914L, 914L, 914L, 910L, 910L, 910L, 910L, 910L), taxa_name = c("Hippoglossina stomata", "Hippoglossina stomata", "Symphurus atricaudus", "Symphurus atricaudus", "Microstomus pacificus", "Hippoglossina stomata", "Hippoglossina stomata", "Symphurus atricaudus", "Symphurus atricaudus", "Parophrys vetulus", "Symphurus atricaudus", "Symphurus atricaudus", "Microstomus pacificus", "Microstomus pacificus", "Microstomus pacificus"), lnXabun = c(21L, 20L, 14L, 10L, 14L, 21L, 23L, 16L, 26L, 17L, 12L, 5L, 9L, 40L, 22L)), row.names = c(NA, -15L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x00362498>)

Calculations

# biomass
bm <- testdf
bm <- bm[, .(site = unique(site)),
   by = list(spcode, taxa_name, biomass)][, totbm := sum(biomass), by = list(spcode)][!duplicated(spcode), c(1,5)]

    > bm
   spcode totbm
1:  10008   0.5
2:  10002   0.3
3:  10006   0.6
4:  10011   0.5

Abundance is done next, then the two are joined on spcode.

# abundance
testdf <- testdf[, .(totabn = sum(lnXabun), n = sum(abundance), minlngth = min(size_class), maxlngth = max(size_class)),
      by = list(spcode, taxa_name)]

# join
testdf[bm, on = 'spcode', bm := i.totbm]

> testdf
   spcode             taxa_name totabn n minlngth maxlngth  bm
1:  10008 Hippoglossina stomata     85 4       20       23 0.5
2:  10002  Symphurus atricaudus     83 7        5       16 0.3
3:  10006 Microstomus pacificus     85 8        9       14 0.6
4:  10011     Parophrys vetulus     17 1       17       17 0.5

The above output of testdf is my desired output. My other attempt relies on two !duplicated calls. In my head, I want to be able to use [, totbm := sum(biomass), by = list(unique(site), spcode)], within the abundance calculation, but that doesn't work.

testdf[, .(site = (site), biomass = biomass, totabn = sum(lnXabun), n = sum(abundance), minlngth = min(size_class), maxlngth = max(size_class)), by = list(spcode, taxa_name)][, totbm := sum(biomass), by = list(unique(site), spcode)]
Error in `[.data.table`(testdf[, .(site = (site), biomass = biomass, totabn = sum(lnXabun),  : The items in the 'by' or 'keyby' list are length (3,15). Each must be length 15; the same length as there are rows in x (after subsetting if i is provided).

Alternative method:

alt <- bm[, .(site = site, taxa_name = taxa_name, biomass = biomass, totabn = sum(lnXabun), n = sum(abundance), minlngth = min(size_class), maxlngth = max(size_class)),
by = list(spcode)]
alt <- alt[!duplicated(alt, by = c("site", "spcode"))]
alt[, totbm := sum(biomass), by = list(spcode)]
alt[!duplicated(alt, by = "spcode"), c(1,3,5:9)]

Upvotes: 2

Views: 66

Answers (2)

Andreas
Andreas

Reputation: 344

Like I mentioned in my comment, I'm not a fan of the table with the data redundancies, but here is one way to solve the problem. Basically, instead of using some sort of 'unique' function, give in index number by group of site/taxa_name so that you can set all but the first biomass value to 0. Then a sum by spcode/taxa_name should work fine. This of course assumes that a set of site/taxa_name values corresponds to exactly one biomass value.

testdf <- data.table(spcode = c(10008L, 10008L, 10002L, 10002L, 10006L, 10008L, 10008L, 10002L, 10002L, 10011L, 10002L, 10002L, 10006L, 10006L, 10006L), 
                         abundance = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 4L, 2L), 
                         biomass = c(0.2, 0.2, 0.1, 0.1, 0.1, 0.3, 0.3, 0.1, 0.1, 0.5, 0.1, 0.1, 0.5, 0.5, 0.5), 
                         size_class = c(21L, 20L, 14L, 10L, 14L, 21L, 23L, 16L, 13L, 17L, 12L, 5L, 9L, 10L, 11L), 
                         site = c(907L, 907L, 907L, 907L, 907L, 914L, 914L, 914L, 914L, 914L, 910L, 910L, 910L, 910L, 910L), 
                         taxa_name = c("Hippoglossina stomata", "Hippoglossina stomata", "Symphurus atricaudus", "Symphurus atricaudus", "Microstomus pacificus", "Hippoglossina stomata", "Hippoglossina stomata", "Symphurus atricaudus", "Symphurus atricaudus", "Parophrys vetulus", "Symphurus atricaudus", "Symphurus atricaudus", "Microstomus pacificus", "Microstomus pacificus", "Microstomus pacificus"), 
                         lnXabun = c(21L, 20L, 14L, 10L, 14L, 21L, 23L, 16L, 26L, 17L, 12L, 5L, 9L, 40L, 22L))

testdf[, biomassIdx := 1:.N, by = c('site', 'taxa_name')]
testdf[biomassIdx > 1, biomass := 0]
testdf[, .(tatabn = sum(lnXabun), n = sum(abundance), minlngth = min(size_class), maxlngth = max(size_class) , bm = sum(biomass)),
        by = list(spcode, taxa_name)]

Upvotes: 3

Alexis
Alexis

Reputation: 5049

Unless I'm missing something, you're complicating yourself a bit. Just do a distinct and a summary:

bm <- testdf[, .SD[1L], by = list(spcode, taxa_name, biomass, site) # distinct
             ][, .(totbm = sum(biomass)), by = "spcode"] # summary

Upvotes: 1

Related Questions