user3375672
user3375672

Reputation: 3768

R - Assign a value/factor in a data.frame to column conditioned on value(s) of other columns

set.seed(8)
df <- data.frame(n = rnorm(5,1), m = rnorm(5,0), l = factor(LETTERS[1:5]))

Have can I make a new column in df conditioned on values or combination of values of n, m and l. For instance make a vector level and assign it low, medium and high based on values of both n and m (pseudo-code):

df$level <- ifelse(df$n < 1 & df$m < 1, "low", ifelse(df$n > 1 & df$m > 1, "high", "medium")

This should give:

df$level

#low medium low low medium 

Or if I would like to assign a value to level based on the l column and a value in n (again, pseudo-code):

df$level <- ifelse(df$n < 1 & df$l == c("A", "B"), "low A/B", "high").

In this case one should get:

df$level

#"low A/B" "high" "high" "high" "high"

Upvotes: 1

Views: 16905

Answers (4)

akrun
akrun

Reputation: 887048

You could also do:

 c("high", "medium", "low")[rowSums(df[,-3] <1)+1]
#[1] "low"    "medium" "low"    "low"    "medium"

c("high", "low A/B")[(df$n <1 &grepl("A|B", df$l)) +1]
#[1] "low A/B" "high"    "high"    "high"    "high"   

Explanation

  • df[,-3] gets the subset of numeric columns i.e. n and m
  • df[,-3] <1 gives a logical index of TRUE, FALSE if the element is <1 or not.
  • By doing rowSums on the above, it gives three possible values - 0, 1, 2 based on whether the corresponding values in each row are both >1, one value <1, and both <1.

    rowSums(df[,-3] <1) #in this example, there are no values equal to 0
    #[1] 2 1 2 2 1
    
  • +1 to the above will give us

    rowSums(df[,-3] <1) +1
    #[1] 3 2 3 3 2
    
  • Using the above as numeric index, we can do:

      c("high", "medium", "low")[rowSums(df[,-3] <1)+1]
      #[1] "low"    "medium" "low"    "low"    "medium"
    
  • low will occupy the places of numeric value 3, medium on 2 and if there was 1, high should occupy that.

Upvotes: 2

Brandon Bertelsen
Brandon Bertelsen

Reputation: 44638

More of an extended comment than an answer, and perhaps not exactly what you're looking for.

Usually, when I need to capture groups of continuous variables and convert them to a single categorical variable, I use clustering and title the clusters according to the values presented. Here's an example using kmeans:

set.seed(8)
df <- data.frame(n = rnorm(5000,1), m = rnorm(5000,0), l = factor(LETTERS[1:5]))
df$Category <- kmeans(df[1:2],7)$cluster

kmeans(df[1:2],7)
K-means clustering with 7 clusters of sizes 593, 606, 649, 626, 641, 1219, 666

Cluster means:
           n           m
1 -0.2097451  0.84837728 # Low-High
2  1.0977826  1.44383531 # Mid-Upper
3  2.1682482 -0.70983193 # High-Low
4 -0.3389432 -0.54514302 # Low-Low
5  2.3332772  0.67415808 # High-Mid
6  0.9816709 -0.01549909 # Upper-Mid
7  0.8859904 -1.46126667 # Mid-Low

df$Category <- factor(df$Category, c("Low-High","Mid-Upper","High-Low","Low-Low",...))

You would have to look at the mean results of the clusters on your own computer (with seed) to be able to label them appropriately. This will also provide you with groupings based on your data rather than an arbitrary threshold that you believe is correct for your data.

Upvotes: 0

Sven Hohenstein
Sven Hohenstein

Reputation: 81683

Here's a solution:

df$level1 <- c("low", "medium", "high")[rowMeans(sign(df[c("n", "m")] - 1)) + 2]

df$level2 <- c("high", "low A/B")[(df$n < 1 & df$l %in% c("A", "B")) + 1]

#           n          m l level1  level2
# 1 0.9154139 -0.1078814 A    low low A/B
# 2 1.8404001 -0.1702891 B medium    high
# 3 0.5365172 -1.0883317 C    low    high
# 4 0.4491650 -3.0110517 D    low    high
# 5 1.7360404 -0.5931743 E medium    high

Upvotes: 3

branch14
branch14

Reputation: 1262

I'm probably missing the question, but when I add a missing closing parenthesis, it seems to work just fine:

> df$level <- ifelse(df$n < 1 & df$m < 1, "low", ifelse(df$n > 1 & df$m > 1, "high", "medium"))
> df
          n          m l  level
1 0.9154139 -0.1078814 A    low
2 1.8404001 -0.1702891 B medium
3 0.5365172 -1.0883317 C    low
4 0.4491650 -3.0110517 D    low
5 1.7360404 -0.5931743 E medium
> df$level
[1] "low"    "medium" "low"    "low"    "medium"

Upvotes: 1

Related Questions