nasia jaffri
nasia jaffri

Reputation: 823

Create a new column based on other column names and values

I am trying to create a new column based on few other column names and their values in that data frame.

The data looks something like this

user_id  Gender Age  Cate_Ch_Bot  Cate_Ch_Phy  Cate_Ch_Chem  Cate_Ch_Comp  Cate_Ch_Zoo
0001      F     13      0              1           0              1              0
0002      M     17      1              1           0              0              0
0003      F     13      0              0           0              0              0
0004      F     12      0              0           1              0              0
0005      F     14      0              1           0              0              1
0006      M     16      0              0           0              0              0

I need to create a category column, which will have all the categories that have the value of 1. If the user does not have a category then it be blank or NA.

So the desired output will be:

user_id  Gender Age Cate_Ch_Bot Cate_Ch_Phy Cate_Ch_Chem Cate_Ch_Comp Cate_Ch_Zoo   Ch_Category
0001      F     13      0            1            0           1           0         Phy:Comp
0002      M     17      1            1            0           0           0         Bot:Phy
0003      F     13      0            0            0           0           0         NA
0004      F     12      0            0            1           0           0         Chem
0005      F     14      0            1            0           0           1         Phy:Zoo
0006      M     16      0            0            0           0           0         NA

I am trying to loop through the column names, but not sure how to do it correctly.

test$category = ""
for (j in 1:dim(test)[1]){
  for (i in colnames(test[4:14])){
    name = colnames(test[i])   
    if (test[j,name] == 1){
      test$category[j] = paste(test$category[j], colnames(test[i]),sep=":")
    }
  }
}

I would deeply appreciate any help in this regard.

Upvotes: 0

Views: 2346

Answers (1)

nrussell
nrussell

Reputation: 18612

How about something like this:

Df <- data.frame(
  user_id=1:6,
  Gender=rep(c("M","F"),3),
  Age=sample(13:17,6,replace=TRUE),
  Cate_Ch_Bot=c(0,1,rep(0,4)),
  Cate_Ch_Phy=c(1,1,0,0,1,0),
  Cate_Ch_Chem=c(0,0,0,1,0,0),
  Cate_Ch_Comp=c(1,0,0,0,0,0),
  Cate_Ch_Zoo=c(0,0,0,0,1,0),
  stringsAsFactors=FALSE)
##
Labs <- gsub("Cate_Ch_","",names(Df)[-c(1:3)])
##
getCols <- function(x)
{
  Reduce(function(x,y){paste0(x,":",y)},Labs[which(x==1)])
}
##
Df$new <- apply(Df[,-c(1:3)],1,function(X){
  if( is.null(getCols(X)) ){
    ""
  } else {
    getCols(X)
  }
})
##
> Df2
  user_id Gender Age Cate_Ch_Bot Cate_Ch_Phy Cate_Ch_Chem Cate_Ch_Comp Cate_Ch_Zoo      new
1       1      M  13           0           1            0            1           0 Phy:Comp
2       2      F  14           1           1            0            0           0  Bot:Phy
3       3      M  16           0           0            0            0           0         
4       4      F  14           0           0            1            0           0     Chem
5       5      M  14           0           1            0            0           1  Phy:Zoo
6       6      F  16           0           0            0            0           0         

Edited: I had to wrap the getCols with an if..else statement inside of the apply function because it was actually returning a list instead of a vector, with NULL elements for the rows of Df where none of the columns had a value of 1. Previously, it looked like a data.frame on the surface, but a closer inspection reveals this:

> class(Df)
[1] "data.frame"
> str(Df)
'data.frame':   6 obs. of  9 variables:
 $ user_id     : int  1 2 3 4 5 6
 $ Gender      : chr  "M" "F" "M" "F" ...
 $ Age         : int  13 14 16 14 14 16
 $ Cate_Ch_Bot : num  0 1 0 0 0 0
 $ Cate_Ch_Phy : num  1 1 0 0 1 0
 $ Cate_Ch_Chem: num  0 0 0 1 0 0
 $ Cate_Ch_Comp: num  1 0 0 0 0 0
 $ Cate_Ch_Zoo : num  0 0 0 0 1 0
 $ new         :List of 6
  ..$ : chr "Phy:Comp"
  ..$ : chr "Bot:Phy"
  ..$ : NULL
  ..$ : chr "Chem"
  ..$ : chr "Phy:Zoo"
  ..$ : NULL

which is undesirable. As for an explanation of what's going on in the solution,

  1. Labs <- gsub("Cate_Ch_","",names(Df)[-c(1:3)]) is just a convenience step, so that there is a vector of ready-made labels to refer to. gsub is taking the names of the target columns and replacing "Cate_Ch_" with an empty string "", so that the remaining text can be used as a label.
  2. The getCols function is structured to operate on a single vector x - in this case, a single row of Df. It uses Reduce to apply a sub-operation (pasting two strings, separated by a :) in a cumulative manner, where this sub-operation is given in terms of an anonymous function function(x,y){ ... }. The input we give to function(x,y) is a subset of our entire Labs vector - the subset begin only those elements in the row where x==1. Using which(x==1) just gives us the indices of the row equal to one. So for row 2 of Df, which(x==1) gives the vector c(1,2) (since Cate_Ch_Bot and Cate_Ch_Phy have a value of 1 in row 2). Evaluating this vector of indices inside of Labs gives you elements 1 and 2 of Labs - c("Bot","Phy"). When this is passed to our Reduce(function(x,y) ... call, it pastes together all of the elements, separated by a :, and returns a single character value "Bot:Phy". If the input to Reduce had been c("A","B","C","D"), it would have returned "A:B:C:D", and so on.
  3. After having defined a function to perform the desired operation on a single row, apply is used to perform the operation over multiple rows. As I noted above, I had to make a slight change to my original call to apply to make sure that it returned a vector and not a list.

And after all that typing... @Richard Scriven points out a better answer:

> apply(Df[-(1:3)] == 1, 1, function(x) {
  paste(gsub(".*_", "", names(which(x))), collapse = ":")
})
[1] "Phy:Comp" "Bot:Phy"  ""         "Chem"     "Phy:Zoo"  "" 

Upvotes: 3

Related Questions