Reputation: 177
I am trying to take the the following dataset and convert it to the second. Basically, I'm trying to fill in the NA's between each ID with that ID.
Each ID corresponds to two time stamps, which I've joined in to a larger date_time column. For purposes of reproduce-ability, it's too computationally expensive to do a sql between join (the date_time column is very large) or even to take the original dataset and create the time stamps between each id and then join it in (I have too many ID's to do this). I've successfully completed both those methods and it just takes too much time for the amount of data I have. I'm hoping to manipulate the data with this data set. It seems like such a simple thing, but it's really got me stumped. Any help would be appreciated.
Current data set:
date_time id
<dttm> <dbl>
1 2017-01-30 08:00:00 NA
2 2017-01-30 08:00:01 NA
3 2017-01-30 08:00:02 1
4 2017-01-30 08:00:03 NA
5 2017-01-30 08:00:04 NA
6 2017-01-30 08:00:05 NA
7 2017-01-30 08:00:06 NA
8 2017-01-30 08:00:07 1
9 2017-01-30 08:00:08 NA
10 2017-01-30 08:00:09 NA
11 2017-01-30 08:00:10 2
12 2017-01-30 08:00:11 NA
13 2017-01-30 08:00:12 NA
14 2017-01-30 08:00:13 NA
15 2017-01-30 08:00:14 2
16 2017-01-30 08:00:15 NA
17 2017-01-30 08:00:16 3
18 2017-01-30 08:00:17 NA
19 2017-01-30 08:00:18 3
20 2017-01-30 08:00:19 NA
Desired dataset:
date_time id
<dttm> <dbl>
1 2017-01-30 08:00:00 NA
2 2017-01-30 08:00:01 NA
3 2017-01-30 08:00:02 1
4 2017-01-30 08:00:03 1
5 2017-01-30 08:00:04 1
6 2017-01-30 08:00:05 1
7 2017-01-30 08:00:06 1
8 2017-01-30 08:00:07 1
9 2017-01-30 08:00:08 NA
10 2017-01-30 08:00:09 NA
11 2017-01-30 08:00:10 2
12 2017-01-30 08:00:11 2
13 2017-01-30 08:00:12 2
14 2017-01-30 08:00:13 2
15 2017-01-30 08:00:14 2
16 2017-01-30 08:00:15 NA
17 2017-01-30 08:00:16 3
18 2017-01-30 08:00:17 3
19 2017-01-30 08:00:18 3
20 2017-01-30 08:00:19 NA
dput() date:
structure(list(date_time = structure(c(1485781200, 1485781201,
1485781202, 1485781203, 1485781204, 1485781205, 1485781206, 1485781207,
1485781208, 1485781209, 1485781210, 1485781211, 1485781212, 1485781213,
1485781214, 1485781215, 1485781216, 1485781217, 1485781218, 1485781219
), class = c("POSIXct", "POSIXt"), tzone = ""), trx_id = c(NA_real_,
NA_real_, 1, NA_real_, NA_real_, NA_real_, NA_real_, 1,
NA_real_, NA_real_, 2, NA_real_, NA_real_, NA_real_, 2,
NA_real_, 3, NA_real_, 3, NA_real_)), .Names = c("date_time",
"trx_id"), row.names = c(NA, -20L), class = c("tbl_df", "tbl",
"data.frame"))
Upvotes: 5
Views: 1366
Reputation: 305
A non tidyr approach, where x is your id column:
x <- c(NA,NA, 1,NA,NA,1, NA, NA, 2, NA, NA,2, NA, 3,NA, NA,3,NA)
timestamps <- paste(unique(x)[!is.na(unique(x))], collapse = "|")
timestamps <- grep(timestamps, x)
timestamps <- matrix(timestamps, ncol = 2, byrow=TRUE)
xmatrix <- apply(timestamps, MARGIN = 1, FUN = function(i) {
y <- x[i[1]:i[2]]
y[is.na(y)] <- x[i][1]
x[i[1]:i[2]] <- y
return(x)
})
(x <- apply(xmatrix, 1,FUN = function(z) {
ifelse(all(is.na(z)), NA, max(z, na.rm=TRUE))
}))
## [1] NA NA 1 1 1 1 NA NA 2 2 2 2 NA 3 3 3 3 NA
HTH
Upvotes: 4
Reputation: 887118
Here is a base R
option. We split
the sequence of rows of the dataset with 'trx_id' (the one OP showed as input data), get the sequence (seq
), stack
it to two column dataset and assign the 'trx_id' to 'ind' column of 'd1' based on the 'values' as index from 'd1'
d1 <- stack(lapply(split(seq_len(nrow(df1)), df1$trx_id), function(x) seq(x[1], x[2])))
df1$trx_id[d1$values] <- d1$ind
df1$trx_id
#[1] NA NA 1 1 1 1 1 1 NA NA 2 2 2 2 2 NA 3 3 3 NA
Upvotes: 4
Reputation: 20095
One solution could be using fill
function from tidyr
. The approach is simple. First create 2 columns one each for prev
and next
values. Use fill
to populate missing values in both columns.
Now, for rows which has same value in both prev_val
and next_val
the value should be updated with prev_val
(that means those missing values are between same number)
df <- read.table(text = "sl date_time, value
1 '2017-01-30 08:00:00' NA
2 '2017-01-30 08:00:01' NA
3 '2017-01-30 08:00:02' 1
4 '2017-01-30 08:00:03' NA
5 '2017-01-30 08:00:04' NA
6 '2017-01-30 08:00:05' NA
7 '2017-01-30 08:00:06' NA
8 '2017-01-30 08:00:07' 1
9 '2017-01-30 08:00:08' NA
10 '2017-01-30 08:00:09' NA
11 '2017-01-30 08:00:10' 2
12 '2017-01-30 08:00:11' NA
13 '2017-01-30 08:00:12' NA
14 '2017-01-30 08:00:13' NA
15 '2017-01-30 08:00:14' 2
16 '2017-01-30 08:00:15' NA
17 '2017-01-30 08:00:16' 3
18 '2017-01-30 08:00:17' NA
19 '2017-01-30 08:00:18' 3
20 '2017-01-30 08:00:19' NA", header = T, stringsAsFactor = F)
#use fill to find missing values
df %>%
mutate(prev_val = (value), next_val = (value)) %>%
fill(prev_val, .direction = "down") %>%
fill(next_val, .direction = "up") %>%
mutate(value = ifelse(prev_val == next_val, prev_val, value )) %>%
select(-prev_val, -next_val)
Result:
sl date_time. value
1 1 2017-01-30 08:00:00 NA
2 2 2017-01-30 08:00:01 NA
3 3 2017-01-30 08:00:02 1
4 4 2017-01-30 08:00:03 1
5 5 2017-01-30 08:00:04 1
6 6 2017-01-30 08:00:05 1
7 7 2017-01-30 08:00:06 1
8 8 2017-01-30 08:00:07 1
9 9 2017-01-30 08:00:08 NA
10 10 2017-01-30 08:00:09 NA
11 11 2017-01-30 08:00:10 2
12 12 2017-01-30 08:00:11 2
13 13 2017-01-30 08:00:12 2
14 14 2017-01-30 08:00:13 2
15 15 2017-01-30 08:00:14 2
16 16 2017-01-30 08:00:15 NA
17 17 2017-01-30 08:00:16 3
18 18 2017-01-30 08:00:17 3
19 19 2017-01-30 08:00:18 3
20 20 2017-01-30 08:00:19 NA
Upvotes: 10