Reputation: 1247

Distance from the closest non NA value in a dataframe

I have the following dataframe df and I want to add a column with the distance from the closest non NA value for each row.

df <- data.frame(x = 1:20)
df[c(1, 3, 4, 5, 11, 14, 15, 16), "x"] <-  NA

In other words, I am looking for the following values:

df$distance <- c(1, 0, 1, 2, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 1, 0, 0, 0, 0)

How can I do this automatically?

Upvotes: 3

Answers (4)

GMSL

Reputation: 425

One method is to use distance() in the raster package, after using the package to convert your matrix into a RasterLayer object using the raster() function.

The package is meant for maps so when you use raster(), your object will have units, resolution, etc. Thus when you use distance(), the distance may be very large for an element that is one away from a non-NA (15796.35 for me). Just divide by this amount (and maybe round() due to rounding errors) to get your answer.

As an example, if I have an array object with NAs called a1:

> a1 = array(
    c(
       c(1, 5, 6, NA, 1, 2, 5),
       c(3, 4, NA, NA, NA, 8, 1),
       c(5, 1, 7, NA, 2, 3, 7),
       c(8, 1, 1, 2, 3, 6, 2)
     ),
    c(7, 4)
  )
> r1 = raster(a1)
> d1 = distance(r1)
> as.matrix(d1)    

         [,1]     [,2]     [,3] [,4]
[1,]     0.00     0.00     0.00    0
[2,]     0.00     0.00     0.00    0
[3,]     0.00 15796.35     0.00    0
[4,] 15796.33 31592.66 15796.33    0
[5,]     0.00 15796.33     0.00    0
[6,]     0.00     0.00     0.00    0
[7,]     0.00     0.00     0.00    0

> round(
     as.matrix(d1) / 15796.35,
     0
  )

     [,1] [,2] [,3] [,4]
[1,]    0    0    0    0
[2,]    0    0    0    0
[3,]    0    1    0    0
[4,]    1    2    1    0
[5,]    0    1    0    0
[6,]    0    0    0    0
[7,]    0    0    0    0

Which is your answer. I don't know how efficient the code behind the distance() function is, though, so I don't know if it will take a while or not.

EDIT: tested on an array object with 29000 NAs and it takes a long time. I advise you just use this for objects with few NAs.

Upvotes: 0

Henrik

Reputation: 67778

You can use findInterval. First, find indices of NA and non-NA values, and initialize a distance column:

na <- which(is.na(df$x))
non_na <- which(!is.na(df$x))
df$distance2 <- 0

Then, use findInterval with midpoints of non-NA indices as breaks to find which interval NA indices fall in. Use the intervals to extract corresponding non-NA indices, calculate absolute difference to NA indices, and assign these at NA indices:

df$distance2[na] <- abs(na - non_na[findInterval(na, (non_na[-length(non_na)] + non_na[-1]) / 2) + 1])

df
#     x distance distance2
# 1  NA        1         1
# 2   2        0         0
# 3  NA        1         1
# 4  NA        2         2
# 5  NA        1         1
# 6   6        0         0
# 7   7        0         0
# 8   8        0         0
# 9   9        0         0
# 10 10        0         0
# 11 NA        1         1
# 12 12        0         0
# 13 13        0         0
# 14 NA        1         1
# 15 NA        2         2
# 16 NA        1         1
# 17 17        0         0
# 18 18        0         0
# 19 19        0         0
# 20 20        0         0

Upvotes: 1

Vlad C.

Reputation: 974

Here is another approach using rle and rank:

library(dplyr)
library(magrittr)

df <- data.frame(x=seq(1, 20))
df[c("1", "3", "4", "5", "11", "14", "15", "16"), 1] <-  NA

rle.len <- df$x %>% is.na %>% rle %$% lengths

df %>% 
  mutate(na.seq=rle.len %>% seq_along %>% rep(rle.len)) %>% 
  group_by(na.seq) %>%
  mutate(distance=ifelse(is.na(x), pmin(rank(na.seq, ties.method = "first"),
                                        rank(na.seq, ties.method = "last")), 0))

    x na.seq distance
1  NA      1        1
2   2      2        0
3  NA      3        1
4  NA      3        2
5  NA      3        1

Upvotes: 2

Zheyuan Li

Reputation: 73325

Let x be your vector containing NA, your question is

a <- which(!is.na(x))
b <- which(is.na(x))

find min(abs(a - b[i])) for every b[i].

This type of task is not easily to be accomplished efficiently with R code. Writing a loop with compiled code is generally a better choice; unless there is some function from some package that already does this for us.

Some naive but straightforward solutions are the following.

If x is not too long, we can use outer:

distance <- numeric(length(x))
distance[is.na(x)] <- apply(abs(outer(a, b, "-")), 2L, min)

If it is long and memory usage of outer becomes a problem, we might do

distance <- numeric(length(x))
distance[is.na(x)] <- sapply(b, function (bi) min(abs(bi - a)))

Note, none of the methods is truly efficient in view of the algorithm.

Upvotes: 4

Distance from the closest non NA value in a dataframe

Answers (4)

Related Questions