HanSwet
HanSwet

Reputation: 11

How do I create a conditional variable based on another variable in R?

I'm back to using R after using SAS for a few years, and I'm relearning everything again.

I have a dataset with variable Lot_Size, which contains continuous data from 0.1980028 - 1.2000000 acres. I'd like to categorize this variable based on these demarcations:

0 - 1/3 acre = 0

1/3 - 2/3 acre = 1

2/3 - 1 acre = 2

1+ acre = 3

Into a new variable LS_cat.

I've explored the mutate command but I keep returning errors. Anyone have any ideas?

UPDATE

Thanks for responding - both solutions worked perfectly. Since this was a learning experience for me, I'll add to the question.

I actually misunderstood the question posed to me - if I were to make dummy variables for each category previously noted, how would I do that? For example, if Lot_Size is 0 - 1/3 of an acre, I want variable ls_1_3 to be 1, if it's not then I'd like it to be 0. Would I use ifelse command?

Upvotes: 0

Views: 929

Answers (4)

Tech Commodities
Tech Commodities

Reputation: 1959

Case_when() is usually a sound solution when there's more than two options (if_else() if there are just two), but in this case there's a simpler math(s) solution.

my_df <- tibble(lot_size = seq(0, 1.2, by = 0.1))
my_df$ls_cat <-  ceiling((my_df$lot_size*3)-0.99)

Though, this may be less instructive on R programming.

For your follow on question, ifelse() works well, e.g.

Base:

my_df$ls_1_3 <- ifelse(my_df$lot_size < 1/3, 1, 0)

Or Tidyverse:

my_df <- my_df %>% 
  mutate(ls_1_3 = if_else(lot_size < 1/3, 1, 0))

NB: if_else() is a more pedantic version of ifelse(). Both should work equally well here, but if_else() is better for catching possible errors

Upvotes: 1

Mwavu
Mwavu

Reputation: 2217

Use case_when().

library(tidyverse)

set.seed(123)
my_df <- tibble(
  lot_size = runif(n = 10, min = 0.1980028, max = 1.2)
)


my_df |> mutate(
  ls_cat = case_when(lot_size < 1 / 3 ~ 0, 
                     lot_size < 2 / 3 ~ 1, 
                     lot_size < 1 ~ 2, 
                     TRUE ~ 3)
)
#> A tibble: 10 x 2
#>   lot_size ls_cat
#>      <dbl>  <dbl>
#> 1    0.486      1
#> 2    0.988      2
#> 3    0.608      1
#> 4    1.08       3
#> 5    1.14       3
#> 6    0.244      0
#> 7    0.727      2
#> 8    1.09       3
#> 9    0.751      2
#>10    0.656      1

Upvotes: 1

jay.sf
jay.sf

Reputation: 72803

cut it.

dat <- transform(dat, Lot_Size_cat=
                   cut(Lot_Size, breaks=c(0, 1/3, 2/3, 1, Inf), labels=0:3,
                       include.lowest=TRUE))
dat
#            X1  Lot_Size Lot_Size_cat
# 1  0.77436849 1.0509024            3
# 2  0.19722419 0.2819626            0
# 3  0.97801384 0.8002238            2
# 4  0.20132735 0.9272001            2
# 5  0.36124443 0.6396998            1
# 6  0.74261194 1.0990851            3
# 7  0.97872844 1.1648617            3
# 8  0.49811371 0.7221819            2
# 9  0.01331584 1.1915689            3
# 10 0.25994613 0.4076475            1

Data:

set.seed(666)
n <- 10
dat <- data.frame(X1=runif(n),
                  Lot_Size=sample(seq(0.1980028, 1.2, 1e-7), n, replace=TRUE))

Upvotes: 0

r2evans
r2evans

Reputation: 160437

We can use findInterval:

Lot_Size <- seq(0.2, 1.2, len=10)
Lot_Size
#  [1] 0.2000000 0.3111111 0.4222222 0.5333333 0.6444444 0.7555556 0.8666667 0.9777778 1.0888889 1.2000000
findInterval(Lot_Size, c(0, 1/3, 2/3, 1, Inf), rightmost.closed = TRUE) - 1L
#  [1] 0 0 1 1 1 2 2 2 3 3

In this case it is returning the index within the vector, which we then convert to your 0-based with the trailing - 1L (integer 1).

Upvotes: 0

Related Questions