user11418708
user11418708

Reputation: 902

How can I count the total number of occurrences at time step t of an element?

I am a beginner in R but I would like to write a piece code that requires some R and data science knowledge.

I have a data frame with the following structure; t1 denotes 10 minutes time periods and 1 defines measurements.

 t1 t2 t3 t4
  1  0  0  0
  1  1  1  1
  0  1  1  1
  0  1  1  1
  1  0  1  1

I would like to identify the duration and starting point for each measurement. For example, there are two 10 min measurements starting at t1 (row 1 and row 5) and there are two 30 minutes measurements that start at t2 (row 3and row 4).

Output:

 duration_minutes   t1 t2 t3 t4
      10            2   0  0  0
      20            1   0  1  0
      30            0   2  0  0
      40            1   0  0  0

Is there a way to convert the counts into percentages?

df<-structure(list(t1 = c(1, 1, 0, 0,1),
                   t2 = c(0, 1, 1, 1,0), t3 = c(0, 1, 1, 1,1), t4 = c(0, 1, 1,  1,1)), row.names = c(NA,5L), class = "data.frame") 

                                                                                  

Upvotes: 5

Views: 190

Answers (2)

nniloc
nniloc

Reputation: 4243

There is a function called rle which seems like it might be nice fit here. This answer is almost all base R, but I couldn't resist dipping into the tidyverse for map_dfr.

library(purrr)

df_rle <- map_dfr(row.names(df), function(x){
  r <- rle(rev(df[x,]))
  rev(r$lengths * r$values * 10)  
})

tab <- table(stack(df_rle), exclude = c('0', NA))
tab
#---------
      ind
values t1 t2 t3
    10  2  0  0
    20  0  0  1
    30  0  2  0
    40  1  0  0

If you wanted to represent these as percentages

sweep(tab,2, colSums(tab), '/')

#--------
      ind
values        t1        t2        t3
    10 0.6666667 0.0000000 0.0000000
    20 0.0000000 0.0000000 1.0000000
    30 0.0000000 1.0000000 0.0000000
    40 0.3333333 0.0000000 0.0000000


Break it down a bit

The rle function when run on a given row gets us pretty close to the desired output. As an example row 5 from df.

r <- rle(c(1,0,1,1))
r
#-----
Run Length Encoding
  lengths: int [1:3] 1 1 2
  values : num [1:3] 1 0 1


# Multiply to get time periods
r$lengths * r$values * 10
# -----
[1] 10  0 20

But with a dataframe row as input, values are linked to the last time period not the first. So we reverse the order in which we feed the row to the rle function, and then un-reverse (?) the results.

r <- rle(df[5,])
r$values
#------
  t1 t2 t4
5  1  0  1

# Reverse before we feed into rle(), reverse the output
r <- rle(rev(df[5,]))
rev(r$lengths * r$values * 10) 
#-----------
  t1 t2 t3
5 10  0 20

Then we need to do this reversed rle function on each row. This answer uses purrr::map_dfr(), which maps the function to each row, then row binds the results together into a single dataframe.

df_rle <- map_dfr(row.names(df), function(x){
  r <- rle(rev(df[x,]))
  rev(r$lengths * r$values * 10)  
}) 


#-----
  t1 t2 t3
1 10  0 NA
2 40 NA NA
3  0 30 NA
4  0 30 NA
5 10  0 20

From here, we need to count the values by the 10 minute duration categories. There are several ways to do this. Here is one way in which we first convert df_rle to a long form two column dataframe using stack, then use the table function to tabulate by duration levels.

tab <- table(stack(df_rle), exclude = c('0', NA))

#--------
      ind
values t1 t2 t3
    10  2  0  0
    20  0  0  1
    30  0  2  0
    40  1  0  0

To convert to percentages (assuming you mean duration as a percent of each time period) you can divide by the original number of measurements, which here is equal to colSums. sweep can apply the value rowwise (1) or column-wise (2), using the function divide '/'.

sweep(tab, 2, colSums(tab), '/')

Upvotes: 2

pieterbons
pieterbons

Reputation: 1724

I have solved it by first pivoting the data into the long format and then counting consecutive 1's to determine the duration of each measurement. Then I count how often each duration occurs per starting time and pivot back to the wider format to get the output that you described (I do not find a column t4 since it doesn't have any new measurements):

library(tidyr)
library(dplyr)

df %>% 
  mutate(rownr = 1:nrow(.)) %>% 
  pivot_longer(names_to = 'time', values_to = 'value', cols = 1:4) %>% 
  group_by(rownr, grp = cumsum(value == 0)) %>% 
  mutate(duration = 10 * cumsum(value)) %>% 
  filter(duration != 0) %>% 
  summarise(time = first(time), 
            duration = max(duration)) %>% 
  group_by(time, duration) %>% 
  count() %>% 
  pivot_wider(names_from = time, values_from = n, values_fill = 0) %>% 
  arrange(duration)

Upvotes: 3

Related Questions