Reputation: 1247
I have the following dataframe df and I want to add a column with the distance from the closest non NA value for each row.
df <- data.frame(x = 1:20)
df[c(1, 3, 4, 5, 11, 14, 15, 16), "x"] <- NA
In other words, I am looking for the following values:
df$distance <- c(1, 0, 1, 2, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 1, 0, 0, 0, 0)
How can I do this automatically?
Upvotes: 3
Views: 1486
Reputation: 425
One method is to use distance()
in the raster
package, after using the package to convert your matrix into a RasterLayer object using the raster()
function.
The package is meant for maps so when you use raster()
, your object will have units, resolution, etc. Thus when you use distance()
, the distance may be very large for an element that is one away from a non-NA (15796.35 for me). Just divide by this amount (and maybe round()
due to rounding errors) to get your answer.
As an example, if I have an array object with NAs called a1
:
> a1 = array(
c(
c(1, 5, 6, NA, 1, 2, 5),
c(3, 4, NA, NA, NA, 8, 1),
c(5, 1, 7, NA, 2, 3, 7),
c(8, 1, 1, 2, 3, 6, 2)
),
c(7, 4)
)
> r1 = raster(a1)
> d1 = distance(r1)
> as.matrix(d1)
[,1] [,2] [,3] [,4]
[1,] 0.00 0.00 0.00 0
[2,] 0.00 0.00 0.00 0
[3,] 0.00 15796.35 0.00 0
[4,] 15796.33 31592.66 15796.33 0
[5,] 0.00 15796.33 0.00 0
[6,] 0.00 0.00 0.00 0
[7,] 0.00 0.00 0.00 0
> round(
as.matrix(d1) / 15796.35,
0
)
[,1] [,2] [,3] [,4]
[1,] 0 0 0 0
[2,] 0 0 0 0
[3,] 0 1 0 0
[4,] 1 2 1 0
[5,] 0 1 0 0
[6,] 0 0 0 0
[7,] 0 0 0 0
Which is your answer. I don't know how efficient the code behind the distance()
function is, though, so I don't know if it will take a while or not.
EDIT: tested on an array object with 29000 NAs and it takes a long time. I advise you just use this for objects with few NAs.
Upvotes: 0
Reputation: 67778
You can use findInterval
. First, find indices of NA
and non-NA
values, and initialize a distance column:
na <- which(is.na(df$x))
non_na <- which(!is.na(df$x))
df$distance2 <- 0
Then, use findInterval
with midpoints of non-NA
indices as breaks to find which interval NA
indices fall in. Use the intervals to extract corresponding non-NA
indices, calculate absolute difference to NA
indices, and assign these at NA
indices:
df$distance2[na] <- abs(na - non_na[findInterval(na, (non_na[-length(non_na)] + non_na[-1]) / 2) + 1])
df
# x distance distance2
# 1 NA 1 1
# 2 2 0 0
# 3 NA 1 1
# 4 NA 2 2
# 5 NA 1 1
# 6 6 0 0
# 7 7 0 0
# 8 8 0 0
# 9 9 0 0
# 10 10 0 0
# 11 NA 1 1
# 12 12 0 0
# 13 13 0 0
# 14 NA 1 1
# 15 NA 2 2
# 16 NA 1 1
# 17 17 0 0
# 18 18 0 0
# 19 19 0 0
# 20 20 0 0
Upvotes: 1
Reputation: 974
Here is another approach using rle
and rank
:
library(dplyr)
library(magrittr)
df <- data.frame(x=seq(1, 20))
df[c("1", "3", "4", "5", "11", "14", "15", "16"), 1] <- NA
rle.len <- df$x %>% is.na %>% rle %$% lengths
df %>%
mutate(na.seq=rle.len %>% seq_along %>% rep(rle.len)) %>%
group_by(na.seq) %>%
mutate(distance=ifelse(is.na(x), pmin(rank(na.seq, ties.method = "first"),
rank(na.seq, ties.method = "last")), 0))
x na.seq distance
1 NA 1 1
2 2 2 0
3 NA 3 1
4 NA 3 2
5 NA 3 1
Upvotes: 2
Reputation: 73325
Let x
be your vector containing NA
, your question is
a <- which(!is.na(x))
b <- which(is.na(x))
find min(abs(a - b[i]))
for every b[i]
.
This type of task is not easily to be accomplished efficiently with R code. Writing a loop with compiled code is generally a better choice; unless there is some function from some package that already does this for us.
Some naive but straightforward solutions are the following.
If x
is not too long, we can use outer
:
distance <- numeric(length(x))
distance[is.na(x)] <- apply(abs(outer(a, b, "-")), 2L, min)
If it is long and memory usage of outer
becomes a problem, we might do
distance <- numeric(length(x))
distance[is.na(x)] <- sapply(b, function (bi) min(abs(bi - a)))
Note, none of the methods is truly efficient in view of the algorithm.
Upvotes: 4