kukuk1de
kukuk1de

Reputation: 406

Count rows between NA's

i'm trying to get a specific count from a previously created result set. I need a count of rows between rows that contain NA's. An aggregation of values of these rows are not of interest, only the count.

Below a quite simplified example which hopefully better explains what i'm talking about. At left hand the actual data and at right hand the desired result.

+------+-------+---+------+--------+
| TIME | Value | - | TIME | Result |
+------+-------+---+------+--------+
|   10 | NA    | - |   20 |      2 |
|   20 | 0     | - |   60 |      3 |
|   30 | 1     | - |      |        |
|   40 | NA    | - |      |        |
|   50 | NA    | - |      |        |
|   60 | 30    | - |      |        |
|   70 | 68    | - |      |        |
|   80 | 0     | - |      |        |
|   90 | NA    | - |      |        |
+------+-------+---+------+--------+

Any comments are welcome. In case additional input is needed please leave a message.

Upvotes: 5

Views: 900

Answers (4)

talat
talat

Reputation: 70316

This is perhaps not the most straightforward way to do it, but it does produce the desired result and since I wrote it, I thought I might post it, too (using the sample data by @konvas):

require(dplyr)

dat %>%
  group_by(m = cumsum(is.na(value))) %>% 
  summarise(n = n() -1, time = first(time[!is.na(value)])) %>%
  ungroup() %>%
  filter(n > 0 & m > 0) %>% 
  select(-m)
#Source: local data frame [2 x 2]
#
#  n time
#1 2   20
#2 3   60

Edit: I made a small correction in response to Ananda's comment, hope it works better now. For example, if the data was:

dat <- data.frame(time = seq(10, 90, 10), value = c(0, 2, 1, NA, NA, 30, 68, 0, NA))
dat
#  time value
#1   10     0
#2   20     2
#3   30     1
#4   40    NA
#5   50    NA
#6   60    30
#7   70    68
#8   80     0
#9   90    NA

The code would result in:

dat %>%
  group_by(m = cumsum(is.na(value))) %>% 
  summarise(n = n() -1, time = first(time[!is.na(value)])) %>%
  ungroup() %>%
  filter(n > 0 & m > 0) %>% 
  select(-m)
#Source: local data frame [1 x 2]
#
#  n time
#1 3   60

Upvotes: 3

konvas
konvas

Reputation: 14366

Apart from rle, you can also use a combination of diff, which and is.na:

dat <- data.frame(time = seq(10, 90, 10), value = c(NA, 0, 1, NA, NA, 30, 68, 0, NA))
res <- data.frame(result = diff(which(is.na(dat$value))) - 1)
res$time <- dat$time[which(is.na(dat$value)) + 1][1:nrow(res)]
res[res$result != 0, ]
# time result
#   20      2
#   60      3

Upvotes: 6

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193657

My "SOfun" package has a function called TrueSeq that is like a group-maker with logical vectors. You can use that function in conjunction with "data.table" to get the result you want, like this:

library(SOfun)
library(data.table)
na.omit(data.table(TIME = df$TIME, Val = TrueSeq(
  !is.na(df$value), zero2NA = TRUE)))[, list(TIME = TIME[1], .N), by = Val]
#    Val TIME N
# 1:   1   20 2
# 2:   2   60 3

If you have "devtools" installed, you can install "SOfun" with:

library(devtools)
install_github("mrdwab/SOfun")

For reference, I've posted this Gist to be able to compare the results from the different approaches at this answer.

To summarize:

  • If the first value in the "values" column is NA:
    • All approaches would yield the same answer.
  • If the first value in the "values" column is not NA:
    • This answer and @RStudent's would be the same, starting with the first non-NA value (thus, the first row of the input) as the first row of the results.
    • @konvas's answer and @beginneR's would be the same, starting with the second non-NA value as the first row of the results.

Upvotes: 5

DatamineR
DatamineR

Reputation: 9618

In order to make my answer complete here a modified version:

d <- data.frame( TIME = seq(10, 90, by = 10), Value = c(NA, 0, 1, NA, NA, 30, 68, 0, NA))


aux <- rle(as.numeric((!is.na(d[,2]))))

cbind(TIME = d[cumsum(aux$lengths)[which(aux$values == 1)] - aux$lengths[aux$values == 1] +1, 1],
Result = rle(is.na(d$Value))$lengths[!rle(is.na(d$Value))$values])
     TIME Result
[1,]    2     20
[2,]    3     60

Upvotes: 6

Related Questions