Reputation: 35

Replace values in one column based on a vector conditionally matching another column

I have the following data frame and I want to replace the reflectance values with NA depending on whether or not a wavelength value falls in a certain grouping of ranges that were determined to be bad measurements (badData vector).

The ranges of bad data might change over time so I would like the solution to be as general as possible.

  badData <- c(296:310, 330:335, 350:565)

  df <- data.frame(wavelength = seq(300,360,5.008667),
                  reflectance = seq(-1,-61,-5.008667))

df 

   wavelength reflectance
   300.0000   -1.000000
   305.0087   -6.008667
   310.0173  -11.017334
   315.0260  -16.026001
   320.0347  -21.034668
   325.0433  -26.043335
   330.0520  -31.052002
   335.0607  -36.060669
   340.0693  -41.069336
   345.0780  -46.078003
   350.0867  -51.086670
   355.0953  -56.095337

I have tried

   Data2 <- df %>% 
  mutate(reflectance = replace(reflectance,wavelength %in% badData, NA))

But because I am trying to do this with wavelength ranges rather than exact values this will not work. I am thinking I should use a conditional statement, but I do not know how to feed a vector with different groupings of ranges through that most efficiently.

The output dataset would be because wavelengths 300.000 and 305.0087 fall between 296 and 310, wavelength 330.05620 is between 330 and 335 and 350.0867 and 355.0953 fall between 350:565.

 wavelength reflectance
   300.0000   NA
   305.0087   NA
   310.0173  -11.017334
   315.0260  -16.026001
   320.0347  -21.034668
   325.0433  -26.043335
   330.0520  NA
   335.0607  -36.060669
   340.0693  -41.069336
   345.0780  -46.078003
   350.0867  NA
   355.0953  NA

Upvotes: 1

Answers (4)

r2evans

Reputation: 161110

The first step is to realize that defining ranges of integers will not work. Instead, I'll go with a list of number pairs:

badData <- list(c(296,310), c(330,335), c(350,565))

with the understanding that we want to check each $wavelength to be within any of these three ranges. More ranges are supported.

The second thing we can do is write a function that checks if a vector of values is within one or more pairs of numbers. (In this example, we "know" that it will not be in more than one, but that's not critical.)

within_ranges <- function(x, lims)  {
  Reduce(`|`, lapply(lims, function(lim) lim[1] <= x & x <= lim[2]))
}

To understand what this is doing, let's debug it, call it, and see what's going on.

debugonce(within_ranges)
within_ranges(df$wavelength, badData)
# debugging in: within_ranges(df$wavelength, badData)
# debug at #1: {
#     Reduce(`|`, lapply(lims, function(lim) lim[1] <= x & x <= 
#         lim[2]))
# }

Let's just run that inner portion:

# Browse[2]> 
lapply(lims, function(lim) lim[1] <= x & x <= lim[2])
# [[1]]
#  [1]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [[2]]
#  [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
# [[3]]
#  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE

So the first element (T,T,F,F,...) is whether the values (x) fall within the first number pair (296 to 310); the second element with the second pair (330 to 335); etc.

The Reduce( part calls the first argument, a function, on the first two arguments, saves the return, and then runs the same function on the return and the third argument. It stores it, then runs the same function on the return and fourth argument (if exists). It repeats this along the entire length of the provided list.

In this example, the function is the literal | (escaped since it is special), so it is "OR"ing the [[1]] vector with the [[2]] vector. You can actually see what is happening if you add accumulate=TRUE:

# Browse[2]> 
Reduce(`|`, lapply(lims, function(lim) lim[1] <= x & x <= lim[2]), accumulate=TRUE)
# [[1]]
#  [1]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [[2]]
#  [1]  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
# [[3]]
#  [1]  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE

The first return is the first vector, unmodified. The second element is the original [[2]] vector ORed with the previous return which is this [[1]] vector (which is the same as the original [[1]]). The third element is the original [[3]] vector ORed with the previous return, which is this [[2]]. This results in the three groupings of TRUE (1, 2, 7, 11, 12) that you are expecting. So we want the [[3]] element, which is what we get without accumulating:

# Browse[2]> 
Reduce(`|`, lapply(lims, function(lim) lim[1] <= x & x <= lim[2]))
#  [1]  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE

Okay, so let's Quit out of the debugger, and give it a full go:

within_ranges(df$wavelength, badData)
#  [1]  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE

This output looks familiar.

(BTW: inside our function, we could also have used
rowSums(sapply(lims, ...)) > 0
and it would have worked just as well. For that, though, you need to realize that sapply should be returning a matrix with as many columns as we have rows of data in df, odd if you aren't familiar.)

Now, we can NAify what we need to either with dplyr:

df %>%
  mutate(
    reflectance = if_else(within_ranges(wavelength, badData), NA_real_, reflectance)
  )
#    wavelength reflectance
# 1    300.0000          NA
# 2    305.0087          NA
# 3    310.0173   -11.01733
# 4    315.0260   -16.02600
# 5    320.0347   -21.03467
# 6    325.0433   -26.04333
# 7    330.0520          NA
# 8    335.0607   -36.06067
# 9    340.0693   -41.06934
# 10   345.0780   -46.07800
# 11   350.0867          NA
# 12   355.0953          NA

Edit: or another dplyr, using your first thought of replace (not my first by habit, no reason):

df %>%
  mutate(
    reflectance = replace(reflectance, within_ranges(wavelength, badData), NA_real_)
  )

or base R:

df$reflectance <- ifelse(within_ranges(df$wavelength, badData), NA_real_, df$reflectance)
df
#    wavelength reflectance
# 1    300.0000          NA
# 2    305.0087          NA
# 3    310.0173   -11.01733
# 4    315.0260   -16.02600
# 5    320.0347   -21.03467
# 6    325.0433   -26.04333
# 7    330.0520          NA
# 8    335.0607   -36.06067
# 9    340.0693   -41.06934
# 10   345.0780   -46.07800
# 11   350.0867          NA
# 12   355.0953          NA

Notes:

I'm specifically using NA_real_, both for clarity (did you know there are different types of NA?), and partly because in the use of dplyr::if_else, it will complain/fail if the classes of the "true" and "false" arguments are not the same (NA is technically logical, not numeric as your reflectance is);
I use dplyr::if_else for the first example, since you're already using dplyr, but in case you choose to forego dplyr (or somebody else does), then the base-R ifelse works, too. (It has its liabilities, but it appears to work just fine here.)

Upvotes: 7

piptoma

Reputation: 796

How about dplyr::between()?

library(dplyr)

df %>% 
  mutate(
    reflectance = case_when(
      between(wavelength, 296, 310) ~ NA_real_,
      between(wavelength, 330, 335) ~ NA_real_,
      between(wavelength, 350, 565) ~ NA_real_,
      TRUE                          ~ reflectance
    )
  )

Upvotes: 1

A. Suliman

Reputation: 13135

Here is a solution based on create a dataframe for badData and tidyr::crossing. Using crossing we can get all combinations between the two dataframes.

badData <- data.frame(start= c(296,330,350),end=c(310.01,335,565))

library(dplyr)
library(tidyr)
library(data.table)

df %>% crossing(badData) %>% 
       mutate(Flag=ifelse(data.table::between(wavelength,start,end),1,0)) %>% 
       arrange(wavelength,desc(Flag)) %>% #Make sure 1 'if exist' at the 1st row for each wavelength before run distinct
       distinct(wavelength,.keep_all=T) %>%
       mutate(reflectance_upd=ifelse(Flag==1,NA,reflectance))

    wavelength reflectance start    end Flag reflectance_upd
1    300.0000   -1.000000   296 310.01    1              NA
2    305.0087   -6.008667   296 310.01    1              NA
3    310.0173  -11.017334   296 310.01    0       -11.01733
4    315.0260  -16.026001   296 310.01    0       -16.02600
5    320.0347  -21.034668   296 310.01    0       -21.03467
6    325.0433  -26.043335   296 310.01    0       -26.04333
7    330.0520  -31.052002   330 335.00    1              NA
8    335.0607  -36.060669   296 310.01    0       -36.06067
9    340.0693  -41.069336   296 310.01    0       -41.06934
10   345.0780  -46.078003   296 310.01    0       -46.07800
11   350.0867  -51.086670   350 565.00    1              NA
12   355.0953  -56.095337   350 565.00    1              NA

Upvotes: 0

msr_003

Reputation: 1233

I think this will help.

    library(TeachingDemos)
    df$reflectance <- ifelse(296 %<% df$wavelength %<% 310 | 330 %<% df$wavelength %<% 335 | 350 %<% df$wavelength %<% 565, NA, df$reflectance) 

> df
   wavelength reflectance
1    300.0000          NA
2    305.0087          NA
3    310.0173   -11.01733
4    315.0260   -16.02600
5    320.0347   -21.03467
6    325.0433   -26.04333
7    330.0520          NA
8    335.0607   -36.06067
9    340.0693   -41.06934
10   345.0780   -46.07800
11   350.0867          NA
12   355.0953          NA

Upvotes: 0

Replace values in one column based on a vector conditionally matching another column

Answers (4)

Related Questions