Reputation: 11
I'm back to using R after using SAS for a few years, and I'm relearning everything again.
I have a dataset with variable Lot_Size, which contains continuous data from 0.1980028 - 1.2000000 acres. I'd like to categorize this variable based on these demarcations:
0 - 1/3 acre = 0
1/3 - 2/3 acre = 1
2/3 - 1 acre = 2
1+ acre = 3
Into a new variable LS_cat.
I've explored the mutate command but I keep returning errors. Anyone have any ideas?
UPDATE
Thanks for responding - both solutions worked perfectly. Since this was a learning experience for me, I'll add to the question.
I actually misunderstood the question posed to me - if I were to make dummy variables for each category previously noted, how would I do that? For example, if Lot_Size is 0 - 1/3 of an acre, I want variable ls_1_3 to be 1, if it's not then I'd like it to be 0. Would I use ifelse command?
Upvotes: 0
Views: 929
Reputation: 1959
Case_when()
is usually a sound solution when there's more than two options (if_else()
if there are just two), but in this case there's a simpler math(s) solution.
my_df <- tibble(lot_size = seq(0, 1.2, by = 0.1))
my_df$ls_cat <- ceiling((my_df$lot_size*3)-0.99)
Though, this may be less instructive on R programming.
For your follow on question, ifelse()
works well, e.g.
Base:
my_df$ls_1_3 <- ifelse(my_df$lot_size < 1/3, 1, 0)
Or Tidyverse:
my_df <- my_df %>%
mutate(ls_1_3 = if_else(lot_size < 1/3, 1, 0))
NB: if_else()
is a more pedantic version of ifelse()
. Both should work equally well here, but if_else()
is better for catching possible errors
Upvotes: 1
Reputation: 2217
Use case_when()
.
library(tidyverse)
set.seed(123)
my_df <- tibble(
lot_size = runif(n = 10, min = 0.1980028, max = 1.2)
)
my_df |> mutate(
ls_cat = case_when(lot_size < 1 / 3 ~ 0,
lot_size < 2 / 3 ~ 1,
lot_size < 1 ~ 2,
TRUE ~ 3)
)
#> A tibble: 10 x 2
#> lot_size ls_cat
#> <dbl> <dbl>
#> 1 0.486 1
#> 2 0.988 2
#> 3 0.608 1
#> 4 1.08 3
#> 5 1.14 3
#> 6 0.244 0
#> 7 0.727 2
#> 8 1.09 3
#> 9 0.751 2
#>10 0.656 1
Upvotes: 1
Reputation: 72803
cut
it.
dat <- transform(dat, Lot_Size_cat=
cut(Lot_Size, breaks=c(0, 1/3, 2/3, 1, Inf), labels=0:3,
include.lowest=TRUE))
dat
# X1 Lot_Size Lot_Size_cat
# 1 0.77436849 1.0509024 3
# 2 0.19722419 0.2819626 0
# 3 0.97801384 0.8002238 2
# 4 0.20132735 0.9272001 2
# 5 0.36124443 0.6396998 1
# 6 0.74261194 1.0990851 3
# 7 0.97872844 1.1648617 3
# 8 0.49811371 0.7221819 2
# 9 0.01331584 1.1915689 3
# 10 0.25994613 0.4076475 1
Data:
set.seed(666)
n <- 10
dat <- data.frame(X1=runif(n),
Lot_Size=sample(seq(0.1980028, 1.2, 1e-7), n, replace=TRUE))
Upvotes: 0
Reputation: 160437
We can use findInterval
:
Lot_Size <- seq(0.2, 1.2, len=10)
Lot_Size
# [1] 0.2000000 0.3111111 0.4222222 0.5333333 0.6444444 0.7555556 0.8666667 0.9777778 1.0888889 1.2000000
findInterval(Lot_Size, c(0, 1/3, 2/3, 1, Inf), rightmost.closed = TRUE) - 1L
# [1] 0 0 1 1 1 2 2 2 3 3
In this case it is returning the index within the vector, which we then convert to your 0-based with the trailing - 1L
(integer 1).
Upvotes: 0