Vesuccio
Vesuccio

Reputation: 607

Counting rows in nested groups

This is an extension of a previous question that I have asked. Unfortunately, I have not successfully modified previous code to fit this new requirement.

I have a data set with a nested group design and need to sequentially count sub-groups. The example data frame below illustrates the structure of my data set. GROUP_LENGTH is the number of rows within the same GROUP and note that the rows within these groups are counted sequentially (i.e., a1 and c2 occur multiple times). GROUP_LENGTH sequentially numbers the rows within each GROUP_LENGTH.

My issues arise with SEQ_SAMPLE. My current code sequentially numbers the GROUP for the whole data set, but I need the code to sequentially number the GROUP within each SITE and SAMPLE (see Result dataframe).

I have tried variations of this code using ave, which has not worked. I also see how to use a for loop, but prefer to avoid this if possible as I have a large data set. I am also open to all solutions, but prefer base package solutions as I want to avoid having to load packages. Thanks in advance for your help.

df <- read.table(text = "SITE   SAMPLE  TIME    GROUP
a   1   1   a1
a   1   2   a2
a   1   3   a2
a   1   4   a3
a   2   1   b1
a   2   2   b1
a   2   3   b1
a   2   4   b3
a   2   5   b3
a   2   6   b3
b   1   1   c1
b   1   2   c2
b   1   3   c3
b   1   4   c4
b   1   5   c4
b   2   1   d1
b   2   2   d2
b   2   3   d3
", header = TRUE)

result <- read.table(text = "SITE   SAMPLE  TIME    GROUP   GROUP_LENGTH    GROUP_SEQ   SEQ_SAMPLE
a   1   1   a1  1   1   1
a   1   2   a2  2   1   2
a   1   3   a2  2   2   2
a   1   4   a3  1   1   3
a   2   1   b1  3   1   1
a   2   2   b1  3   2   1
a   2   3   b1  3   3   1
a   2   4   b3  3   1   2
a   2   5   b3  3   2   2
a   2   6   b3  3   3   2
b   1   1   c1  1   1   1
b   1   2   c2  1   1   2
b   1   3   c3  1   1   3
b   1   4   c4  2   1   4
b   1   5   c4  2   2   4
b   2   1   d1  1   1   1
b   2   2   d2  1   2   2
b   2   3   d3  1   3   3
", header = TRUE)

# First two lines of the code work fine....but it is the last line where the bug lines...
df$GROUP_LENGTH = with(rle(paste(df$SAMPLE, df$GROUP)), rep(lengths, lengths))
df$GROUP_SEQ = sequence(rle(paste(df$SAMPLE, df$GROUP))$lengths)

df$SEQ_SAMPLE = with(rle(paste(df$SITE, df$SAMPLE, df$GROUP)), rep(seq_along(values),lengths))

Upvotes: 1

Views: 323

Answers (2)

AnilGoyal
AnilGoyal

Reputation: 26236

A tidyverse solution, which is very easy as it is direct application of three basic functions in dplyr,

  • n() returning count of rows
  • row_number() which numbers the rows requentially
  • dense_rank() which provides each element an identity
library(dplyr)
df %>% group_by(GROUP) %>%
  mutate(GROUP_LENGTH = n(),
         GROUP_SEQ = row_number()) %>%
  group_by(SITE, SAMPLE) %>%
  mutate(SEQ_SAMPLE = dense_rank(GROUP)) %>%
  ungroup()

# A tibble: 18 x 7
   SITE  SAMPLE  TIME GROUP GROUP_LENGTH GROUP_SEQ SEQ_SAMPLE
   <chr>  <int> <int> <chr>        <int>     <int>      <int>
 1 a          1     1 a1               1         1          1
 2 a          1     2 a2               2         1          2
 3 a          1     3 a2               2         2          2
 4 a          1     4 a3               1         1          3
 5 a          2     1 b1               3         1          1
 6 a          2     2 b1               3         2          1
 7 a          2     3 b1               3         3          1
 8 a          2     4 b3               3         1          2
 9 a          2     5 b3               3         2          2
10 a          2     6 b3               3         3          2
11 b          1     1 c1               1         1          1
12 b          1     2 c2               1         1          2
13 b          1     3 c3               1         1          3
14 b          1     4 c4               2         1          4
15 b          1     5 c4               2         2          4
16 b          2     1 d1               1         1          1
17 b          2     2 d2               1         1          2
18 b          2     3 d3               1         1          3

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 389355

You may use ave and create unique number for GROUP within each SITE and SAMPLE.

df$SEQ_SAMPLE = with(df, as.integer(ave(GROUP, SITE, SAMPLE, 
               FUN = function(x) with(rle(x), rep(seq_along(values), lengths)))))


identical(df$SEQ_SAMPLE, result$SEQ_SAMPLE)
#[1] TRUE

Upvotes: 3

Related Questions