ktyagi
ktyagi

Reputation: 1086

R: Calculating cumulative number of unique entries

I have a data frame from several experiments. I am looking to calculate cumulative number of unique values obtained after each successive experiment.

For example, consider:

test <- data.frame(exp = c( rep("exp1" , 4) , rep("exp2" , 4), rep("exp3" , 4) , rep("exp4" , 5) ) , 
                   entries = c("abcd","efgh","ijkl","mnop", "qrst" , "uvwx" , "abcd","efgh","ijkl" , "qrst" , "uvwx", 
                               "yzab" , "yzab" , "cdef" , "mnop" , "uvwx" , "ghij"))

> test
    exp entries
1  exp1    abcd
2  exp1    efgh
3  exp1    ijkl
4  exp1    mnop
5  exp2    qrst
6  exp2    uvwx
7  exp2    abcd
8  exp2    efgh
9  exp3    ijkl
10 exp3    qrst
11 exp3    uvwx
12 exp3    yzab
13 exp4    yzab
14 exp4    cdef
15 exp4    mnop
16 exp4    uvwx
17 exp4    ghij

total number of unique entries are nine. Now I want the result to look like:

   exp cum_unique_entries
1  exp1    4
2  exp2    6
3  exp3    7
4  exp4    9

Finally I want to plot this in the form of a barplot. I can do this with for loops approach, but feel there has to be more elegant way.

Upvotes: 5

Views: 2578

Answers (4)

G. Grothendieck
G. Grothendieck

Reputation: 269852

1) sqldf This can be done in a single complex join:

library(sqldf)

sqldf("select b.exp, count(Distinct a.entries) cum 
       from test a join test b on a.exp <= b.exp group by b.exp")

giving:

   exp cum
1 exp1   4
2 exp2   6
3 exp3   7
4 exp4   9

This idea could be translated into the corresponding code using the other frameworks as well although it would be rather inefficient in base and dplyr because one would have to do an n x n cartesian join and then subset it whereas SQL might optimize it to avoid producing the large intermediate result.

2) rollapplyr Sort test by exp and use rollapplyr to calculate the cumulative number of unique entries. Then use !duplicated(..., fromLast = TRUE) take the last of each exp group

library(zoo)

n <- nrow(test)    
test_sorted <- test[order(test$exp), ]
len_uniq <- function(x) length(unique(x))
test_cum <- transform(test_sorted,  cum = rollapplyr(entries, 1:n, len_uniq, fill = NA))
test_cum[!duplicated(test_cum$exp, fromLast = TRUE), -2]

giving:

    exp cum
4  exp1   4
8  exp2   6
12 exp3   7
17 exp4   9

Upvotes: 1

Bruno Zamengo
Bruno Zamengo

Reputation: 860

I would use the cast function from the reshape package (simple example, full reference). It sould be as simple as

reshape::cast(test, exp~., value="entries", function(x) length(unique(x)) )

What I'm doing is telling the function to take into account your dataset, use the test variable as id-variable and "ignore" all the other variables (that's the . meaning), adding the "metric" returned by the given function (function(x) length(unique(x))) computed on the column named entries.

It's somthing like the SQL aggregate functions and group by construct! It's SQL equivalent is

SELECT exp, count(distinct entries)
FROM test
GROUP BY test

Upvotes: 0

acylam
acylam

Reputation: 18681

Here's another solution with dplyr:

library(dplyr)

test %>%
  mutate(cum_unique_entries = cumsum(!duplicated(entries))) %>%
  group_by(exp) %>%
  slice(n()) %>%
  select(-entries)

or

test %>%
  mutate(cum_unique_entries = cumsum(!duplicated(entries))) %>%
  group_by(exp) %>%
  summarise(cum_unique_entries = last(cum_unique_entries))

Result:

# A tibble: 4 x 2
     exp cum_unique_entries
  <fctr>              <int>
1   exp1                  4
2   exp2                  6
3   exp3                  7
4   exp4                  9

Note:

First find the cumulative sum of all non-duplicates (cumsum(!duplicated(entries))), group_by exp, and take the last cumsum of each group, this number would be the cumulative unique entries for each group.

Upvotes: 10

dww
dww

Reputation: 31452

With library(data.table) we can do

setDT(test)[, new := cumsum(!duplicated(entries))]
test[, .(cum_unique_entries = new[.N]), by = exp]

Upvotes: 4

Related Questions