Reputation: 902
I am a beginner in R but I would like to write a piece code that requires some R and data science knowledge.
I have a data frame with the following structure; t1 denotes 10 minutes time periods and 1 defines measurements.
t1 t2 t3 t4
1 0 0 0
1 1 1 1
0 1 1 1
0 1 1 1
1 0 1 1
I would like to identify the duration and starting point for each measurement. For example, there are two 10 min measurements starting at t1 (row 1 and row 5) and there are two 30 minutes measurements that start at t2 (row 3and row 4).
Output:
duration_minutes t1 t2 t3 t4
10 2 0 0 0
20 1 0 1 0
30 0 2 0 0
40 1 0 0 0
Is there a way to convert the counts into percentages?
df<-structure(list(t1 = c(1, 1, 0, 0,1),
t2 = c(0, 1, 1, 1,0), t3 = c(0, 1, 1, 1,1), t4 = c(0, 1, 1, 1,1)), row.names = c(NA,5L), class = "data.frame")
Upvotes: 5
Views: 190
Reputation: 4243
There is a function called rle
which seems like it might be nice fit here. This answer is almost all base
R, but I couldn't resist dipping into the tidyverse
for map_dfr
.
library(purrr)
df_rle <- map_dfr(row.names(df), function(x){
r <- rle(rev(df[x,]))
rev(r$lengths * r$values * 10)
})
tab <- table(stack(df_rle), exclude = c('0', NA))
tab
#---------
ind
values t1 t2 t3
10 2 0 0
20 0 0 1
30 0 2 0
40 1 0 0
If you wanted to represent these as percentages
sweep(tab,2, colSums(tab), '/')
#--------
ind
values t1 t2 t3
10 0.6666667 0.0000000 0.0000000
20 0.0000000 0.0000000 1.0000000
30 0.0000000 1.0000000 0.0000000
40 0.3333333 0.0000000 0.0000000
Break it down a bit
The rle
function when run on a given row gets us pretty close to the desired output. As an example row 5 from df
.
r <- rle(c(1,0,1,1))
r
#-----
Run Length Encoding
lengths: int [1:3] 1 1 2
values : num [1:3] 1 0 1
# Multiply to get time periods
r$lengths * r$values * 10
# -----
[1] 10 0 20
But with a dataframe row as input, values are linked to the last time period not the first. So we reverse the order in which we feed the row to the rle
function, and then un-reverse (?) the results.
r <- rle(df[5,])
r$values
#------
t1 t2 t4
5 1 0 1
# Reverse before we feed into rle(), reverse the output
r <- rle(rev(df[5,]))
rev(r$lengths * r$values * 10)
#-----------
t1 t2 t3
5 10 0 20
Then we need to do this reversed rle function on each row. This answer uses purrr::map_dfr()
, which maps the function to each row, then row binds the results together into a single dataframe.
df_rle <- map_dfr(row.names(df), function(x){
r <- rle(rev(df[x,]))
rev(r$lengths * r$values * 10)
})
#-----
t1 t2 t3
1 10 0 NA
2 40 NA NA
3 0 30 NA
4 0 30 NA
5 10 0 20
From here, we need to count the values by the 10 minute duration categories. There are several ways to do this. Here is one way in which we first convert df_rle
to a long form two column dataframe using stack
, then use the table
function to tabulate by duration levels.
tab <- table(stack(df_rle), exclude = c('0', NA))
#--------
ind
values t1 t2 t3
10 2 0 0
20 0 0 1
30 0 2 0
40 1 0 0
To convert to percentages (assuming you mean duration as a percent of each time period) you can divide by the original number of measurements, which here is equal to colSums
. sweep
can apply the value rowwise (1) or column-wise (2), using the function divide '/'
.
sweep(tab, 2, colSums(tab), '/')
Upvotes: 2
Reputation: 1724
I have solved it by first pivoting the data into the long format and then counting consecutive 1's to determine the duration of each measurement. Then I count how often each duration occurs per starting time and pivot back to the wider format to get the output that you described (I do not find a column t4 since it doesn't have any new measurements):
library(tidyr)
library(dplyr)
df %>%
mutate(rownr = 1:nrow(.)) %>%
pivot_longer(names_to = 'time', values_to = 'value', cols = 1:4) %>%
group_by(rownr, grp = cumsum(value == 0)) %>%
mutate(duration = 10 * cumsum(value)) %>%
filter(duration != 0) %>%
summarise(time = first(time),
duration = max(duration)) %>%
group_by(time, duration) %>%
count() %>%
pivot_wider(names_from = time, values_from = n, values_fill = 0) %>%
arrange(duration)
Upvotes: 3