Reputation: 1777
I'm really new to R but I haven't been able to find a simple solution to this. As an example, I have the following dataframe:
case <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)
areas <- c(1,2,1,1,1,2,2,2,2,1,1,2,2,2,1,1,1,2,2,2)
A <- c(1,2,11,12,20,21,26,43,43,47,48,59,63,64,65,66,67,83,90,91)
var <- c(1,1,0,0,0,1,1,0,0,1,0,1,0,1,1,0,0,0,0,0)
outcome <- c(1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0)
df <- data.frame(case,areas,A,var,outcome)
case areas A var outcome
1 1 1 1 1 1
2 2 2 2 1 0
3 3 1 11 0 0
4 4 1 12 0 0
5 5 1 20 0 0
6 6 2 21 1 0
7 7 2 26 1 0
8 8 2 43 0 0
9 9 2 43 0 0
10 10 1 47 1 1
11 11 1 48 0 0
12 12 2 59 1 1
13 13 2 63 0 0
14 14 2 64 1 0
15 15 1 65 1 0
16 16 1 66 0 0
17 17 1 67 0 0
18 18 2 83 0 1
19 19 2 90 0 0
20 20 2 91 0 0
in the 'A' column we have a wide range of integers, and I'd like to create an extra column that groups each case by its membership to the following categories:
<5; 5 - 19; 20 - 49; 50 - 79; 80+
So the first 3 rows of the column should be a string value that says "<5", "<5", "5 - 19"... and so on, and the last value in the column will be "80+".
I could write out something like this, but it seems very sloppy:
A_groups = ifelse(df$A<5, "<5", df$A)
A_groups = ifelse(df$A>4 & df$A<20, "5-19", A_groups)
A_groups = ifelse(df$A>19 & df$A<50, "20-49", A_groups)
What is the best alternative to this?
Upvotes: 0
Views: 50
Reputation: 532
You're looking for the cut()
function. You want to create a factor based on interval, which is what this function provides.
df$new_factor <- cut(df$A, breaks = c(-Inf, 5, 20, 50, 80, Inf),
labels = c('<5', '5-19', '20-49', '50-79', '80+'),
right = FALSE)
View the helppage: ?cut
to see why I included right = FALSE
. To double check whether it works what you do, it's always nice to create some cases where you wouldn't be sure of. For example: check case == 5
with right = FALSE
on and without it and see what happens to new_factor
.
Upvotes: 2
Reputation: 3619
You can use cut()
or findInterval()
.
breaks = c(0,5,20,50,80,Inf)
labels = c("<5", "5-19", "20-49", "50-79", "80+")
# Using cut()
df$A_groups = cut(df$A, breaks = breaks, right = FALSE, labels = labels)
# Using findInterval()
df$B_groups = factor(findInterval(df$A, breaks), labels = labels)
Upvotes: 1