fffrost
fffrost

Reputation: 1777

create column of categorised values from a column of integers in R

I'm really new to R but I haven't been able to find a simple solution to this. As an example, I have the following dataframe:

case <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)
areas <- c(1,2,1,1,1,2,2,2,2,1,1,2,2,2,1,1,1,2,2,2)
A <- c(1,2,11,12,20,21,26,43,43,47,48,59,63,64,65,66,67,83,90,91)
var <- c(1,1,0,0,0,1,1,0,0,1,0,1,0,1,1,0,0,0,0,0)
outcome <- c(1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0)

df <- data.frame(case,areas,A,var,outcome)

   case areas  A var outcome
1     1     1  1   1       1
2     2     2  2   1       0
3     3     1 11   0       0
4     4     1 12   0       0
5     5     1 20   0       0
6     6     2 21   1       0
7     7     2 26   1       0
8     8     2 43   0       0
9     9     2 43   0       0
10   10     1 47   1       1
11   11     1 48   0       0
12   12     2 59   1       1
13   13     2 63   0       0
14   14     2 64   1       0
15   15     1 65   1       0
16   16     1 66   0       0
17   17     1 67   0       0
18   18     2 83   0       1
19   19     2 90   0       0
20   20     2 91   0       0

in the 'A' column we have a wide range of integers, and I'd like to create an extra column that groups each case by its membership to the following categories:

<5; 5 - 19; 20 - 49; 50 - 79; 80+

So the first 3 rows of the column should be a string value that says "<5", "<5", "5 - 19"... and so on, and the last value in the column will be "80+".

I could write out something like this, but it seems very sloppy:

A_groups = ifelse(df$A<5, "<5", df$A)
A_groups = ifelse(df$A>4 & df$A<20, "5-19", A_groups)
A_groups = ifelse(df$A>19 & df$A<50, "20-49", A_groups)

What is the best alternative to this?

Upvotes: 0

Views: 50

Answers (2)

Anonymous
Anonymous

Reputation: 532

You're looking for the cut() function. You want to create a factor based on interval, which is what this function provides.

df$new_factor <- cut(df$A, breaks = c(-Inf, 5, 20, 50, 80, Inf),
                 labels = c('<5', '5-19', '20-49', '50-79', '80+'),
                 right = FALSE)

View the helppage: ?cut to see why I included right = FALSE. To double check whether it works what you do, it's always nice to create some cases where you wouldn't be sure of. For example: check case == 5 with right = FALSE on and without it and see what happens to new_factor.

Upvotes: 2

hpesoj626
hpesoj626

Reputation: 3619

You can use cut() or findInterval().

breaks = c(0,5,20,50,80,Inf)
labels = c("<5", "5-19", "20-49", "50-79", "80+")

# Using cut()
df$A_groups = cut(df$A, breaks = breaks, right = FALSE, labels = labels)

# Using findInterval()
df$B_groups = factor(findInterval(df$A, breaks), labels = labels)

Upvotes: 1

Related Questions