dalitmil
dalitmil

Reputation: 41

How to fix a simple code to conditional mutate

I need to create a new variable based on values in the old column in the data frame. If the existing value ends with .0, the new value in the new column will be 0 / But If the existing value ends with .1 the new value in the new column will be 1 / But the small code I used does not distinguish between 11.0 and 11.1 (and in other pairs with a similar pattern).

I attach an example and the unsuccessful solution

c<-c(1.1,1.0, 0.1, 0.0, 80.1, 80.0, 91.1, 91.0, 11.1,11.0)
b<-c(1,1,0,0,80,80,91,91,11,11)

cb<-data.frame(b,c) #this is exaple to my data

cb<-mutate(cb, a = ifelse(grepl( ".1" ,   cb$c ), 1, 0 )) #this is my unsuccessful solution

a<-c(1,0,1,0,1,0,1,0,1,0)
abc<-data.frame(a,c,b) # This is the desired result

As can be seen in the code, for values 81.0 and 11.0, an incorrect value of 1 was created instead of 0

Upvotes: 0

Views: 44

Answers (1)

r2evans
r2evans

Reputation: 160407

Your regex might be wrong, your ".1" should really be "\\.1" to look for the literal dot.

Since your data is actually numeric, you really should be testing it numerically. Unfortunately (and this bites you whether numeric or grepl comparing), your ###.1 might internally convert to ###.09999999, which will obviously fail. While you could in theory generate a regular expression that catches this mistake, that starts getting a bit complicated (https://xkcd.com/1171/). So, you should test it numerically.

But since it is floating point, doing something like

if_else(c %% 1 == 0.1, 1, 0)

can fail for the same reason. A larger example:

seq(0.1, 10.1)
#  [1]  0.1  1.1  2.1  3.1  4.1  5.1  6.1  7.1  8.1  9.1 10.1
seq(0.1, 10.1) %% 1
#  [1] 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1

### this is where it gets interesting
(seq(0.1, 10.1) %% 1) == 0.1
#  [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
(seq(0.1, 10.1) %% 1) - 0.1
#  [1]  0.000000e+00  8.326673e-17  8.326673e-17  8.326673e-17 -3.608225e-16 -3.608225e-16 -3.608225e-16 -3.608225e-16
#  [9] -3.608225e-16 -3.608225e-16 -3.608225e-16

In reality, anything with numeric (floating-point) should focus on tests of inequality vice equality, for reasons suggested in R FAQ 7.31 (and IEEE-754). Long-story-short: because digital storage has a limit on precision, it is feasible you will never get precisely the number you want when comparing. (You might get it right 99.9% of the time, but that 0.1% will test incorrectly with no indication to you.)

Consider a test of inequality:

if_else(abs(c %% 1 - 0.1) < 1e-8, 1, 0)
# or just
1L * (abs(c %% 1 - 0.1) < 1e-8)

### using the demo from above
abs(seq(0.1, 10.1) %% 1 - 0.1) < 1e-8
#  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

My choice of 1e-8 is arbitrary and merely a starting point, as it depends on the type of data you have. Since your comparison-scale is "0.1", then frankly you could use abs(c %% 1 - 0.1) < 0.01. The smallest you can get with practical use is .Machine$double.eps (see ?.Machine for definition of its properties), though I find in many applications that something orders-of-magnitude larger is still fine.

N.B.: generally-speaking, this depends entirely on the domain of the numbers, so please don't blindly use 1e-8 without understanding the premise and consequences of choosing the wrong boundary.

(And I stand by my comment: typing and using a variable named c just ... makes my brain hurt :-)

Upvotes: 2

Related Questions