jstauss
jstauss

Reputation: 177

R - how to fill NA's between two corresponding ID's in a dataframe

I am trying to take the the following dataset and convert it to the second. Basically, I'm trying to fill in the NA's between each ID with that ID.

Each ID corresponds to two time stamps, which I've joined in to a larger date_time column. For purposes of reproduce-ability, it's too computationally expensive to do a sql between join (the date_time column is very large) or even to take the original dataset and create the time stamps between each id and then join it in (I have too many ID's to do this). I've successfully completed both those methods and it just takes too much time for the amount of data I have. I'm hoping to manipulate the data with this data set. It seems like such a simple thing, but it's really got me stumped. Any help would be appreciated.

Current data set:

             date_time     id
                <dttm>  <dbl>
 1 2017-01-30 08:00:00     NA
 2 2017-01-30 08:00:01     NA
 3 2017-01-30 08:00:02     1
 4 2017-01-30 08:00:03     NA
 5 2017-01-30 08:00:04     NA
 6 2017-01-30 08:00:05     NA
 7 2017-01-30 08:00:06     NA
 8 2017-01-30 08:00:07     1
 9 2017-01-30 08:00:08     NA
10 2017-01-30 08:00:09     NA
11 2017-01-30 08:00:10     2
12 2017-01-30 08:00:11     NA
13 2017-01-30 08:00:12     NA
14 2017-01-30 08:00:13     NA
15 2017-01-30 08:00:14     2
16 2017-01-30 08:00:15     NA
17 2017-01-30 08:00:16     3
18 2017-01-30 08:00:17     NA
19 2017-01-30 08:00:18     3
20 2017-01-30 08:00:19     NA

Desired dataset:

            date_time     id
                <dttm>  <dbl>
 1 2017-01-30 08:00:00     NA
 2 2017-01-30 08:00:01     NA
 3 2017-01-30 08:00:02     1
 4 2017-01-30 08:00:03     1
 5 2017-01-30 08:00:04     1
 6 2017-01-30 08:00:05     1
 7 2017-01-30 08:00:06     1
 8 2017-01-30 08:00:07     1
 9 2017-01-30 08:00:08     NA
10 2017-01-30 08:00:09     NA
11 2017-01-30 08:00:10     2
12 2017-01-30 08:00:11     2
13 2017-01-30 08:00:12     2
14 2017-01-30 08:00:13     2
15 2017-01-30 08:00:14     2
16 2017-01-30 08:00:15     NA
17 2017-01-30 08:00:16     3
18 2017-01-30 08:00:17     3
19 2017-01-30 08:00:18     3
20 2017-01-30 08:00:19     NA

dput() date:

structure(list(date_time = structure(c(1485781200, 1485781201, 
1485781202, 1485781203, 1485781204, 1485781205, 1485781206, 1485781207, 
1485781208, 1485781209, 1485781210, 1485781211, 1485781212, 1485781213, 
1485781214, 1485781215, 1485781216, 1485781217, 1485781218, 1485781219
), class = c("POSIXct", "POSIXt"), tzone = ""), trx_id = c(NA_real_, 
NA_real_, 1, NA_real_, NA_real_, NA_real_, NA_real_, 1, 
NA_real_, NA_real_, 2, NA_real_, NA_real_, NA_real_, 2, 
NA_real_, 3, NA_real_, 3, NA_real_)), .Names = c("date_time", 
"trx_id"), row.names = c(NA, -20L), class = c("tbl_df", "tbl", 
"data.frame"))

Upvotes: 5

Views: 1366

Answers (3)

James Thomas Durant
James Thomas Durant

Reputation: 305

A non tidyr approach, where x is your id column:

x <- c(NA,NA, 1,NA,NA,1, NA, NA, 2, NA, NA,2, NA, 3,NA, NA,3,NA)

timestamps <- paste(unique(x)[!is.na(unique(x))], collapse = "|")

timestamps <- grep(timestamps, x)
timestamps <- matrix(timestamps, ncol = 2, byrow=TRUE)

xmatrix <- apply(timestamps, MARGIN = 1, FUN = function(i) {
  y <- x[i[1]:i[2]]
  y[is.na(y)] <- x[i][1]
  x[i[1]:i[2]] <- y 
  return(x)
})

(x <- apply(xmatrix, 1,FUN = function(z) {

  ifelse(all(is.na(z)), NA, max(z, na.rm=TRUE))
  }))

##  [1] NA NA  1  1  1  1 NA NA  2  2  2  2 NA  3  3  3  3 NA

HTH

Upvotes: 4

akrun
akrun

Reputation: 887118

Here is a base R option. We split the sequence of rows of the dataset with 'trx_id' (the one OP showed as input data), get the sequence (seq), stack it to two column dataset and assign the 'trx_id' to 'ind' column of 'd1' based on the 'values' as index from 'd1'

d1 <- stack(lapply(split(seq_len(nrow(df1)), df1$trx_id), function(x) seq(x[1], x[2])))
df1$trx_id[d1$values] <- d1$ind
df1$trx_id
#[1] NA NA  1  1  1  1  1  1 NA NA  2  2  2  2  2 NA  3  3  3 NA

Upvotes: 4

MKR
MKR

Reputation: 20095

One solution could be using fill function from tidyr. The approach is simple. First create 2 columns one each for prev and next values. Use fill to populate missing values in both columns.

Now, for rows which has same value in both prev_val and next_val the value should be updated with prev_val (that means those missing values are between same number)

df <-  read.table(text = "sl date_time, value
1 '2017-01-30 08:00:00'     NA
2 '2017-01-30 08:00:01'     NA
3 '2017-01-30 08:00:02'     1
4 '2017-01-30 08:00:03'     NA
5 '2017-01-30 08:00:04'     NA
6 '2017-01-30 08:00:05'     NA
7 '2017-01-30 08:00:06'     NA
8 '2017-01-30 08:00:07'     1
9 '2017-01-30 08:00:08'     NA
10 '2017-01-30 08:00:09'     NA
11 '2017-01-30 08:00:10'     2
12 '2017-01-30 08:00:11'     NA
13 '2017-01-30 08:00:12'     NA
14 '2017-01-30 08:00:13'     NA
15 '2017-01-30 08:00:14'     2
16 '2017-01-30 08:00:15'     NA
17 '2017-01-30 08:00:16'     3
18 '2017-01-30 08:00:17'     NA
19 '2017-01-30 08:00:18'     3
20 '2017-01-30 08:00:19'     NA", header = T, stringsAsFactor = F)

#use fill to find missing values
df %>%
  mutate(prev_val = (value), next_val = (value)) %>%
  fill(prev_val, .direction = "down") %>%
  fill(next_val, .direction = "up") %>%
  mutate(value = ifelse(prev_val == next_val, prev_val, value )) %>%
  select(-prev_val, -next_val)

Result:
   sl          date_time. value
1   1 2017-01-30 08:00:00    NA
2   2 2017-01-30 08:00:01    NA
3   3 2017-01-30 08:00:02     1
4   4 2017-01-30 08:00:03     1
5   5 2017-01-30 08:00:04     1
6   6 2017-01-30 08:00:05     1
7   7 2017-01-30 08:00:06     1
8   8 2017-01-30 08:00:07     1
9   9 2017-01-30 08:00:08    NA
10 10 2017-01-30 08:00:09    NA
11 11 2017-01-30 08:00:10     2
12 12 2017-01-30 08:00:11     2
13 13 2017-01-30 08:00:12     2
14 14 2017-01-30 08:00:13     2
15 15 2017-01-30 08:00:14     2
16 16 2017-01-30 08:00:15    NA
17 17 2017-01-30 08:00:16     3
18 18 2017-01-30 08:00:17     3
19 19 2017-01-30 08:00:18     3
20 20 2017-01-30 08:00:19    NA

Upvotes: 10

Related Questions