reduce row to unique items

Question

I have the dataframe

test <- structure(list(
     y2002 = c("freshman","freshman","freshman","sophomore","sophomore","senior"),
     y2003 = c("freshman","junior","junior","sophomore","sophomore","senior"),
     y2004 = c("junior","sophomore","sophomore","senior","senior",NA),
     y2005 = c("senior","senior","senior",NA, NA, NA)), 
              .Names = c("2002","2003","2004","2005"),
              row.names = c(c(1:6)),
              class = "data.frame")
> test
       2002      2003      2004   2005
1  freshman  freshman    junior senior
2  freshman    junior sophomore senior
3  freshman    junior sophomore senior
4 sophomore sophomore    senior   
5 sophomore sophomore    senior   
6    senior    senior

And I would like to munge the data to get the individual steps only for each row, as in

result <- structure(list(
 y2002 = c("freshman","freshman","freshman","sophomore","sophomore","senior"),
 y2003 = c("junior","junior","junior","senior","senior",NA),
 y2004 = c("senior","sophomore","sophomore",NA,NA,NA),
 y2005 = c(NA,"senior","senior",NA, NA, NA)), 
               .Names = c("1","2","3","4"),
               row.names = c(c(1:6)),
               class = "data.frame")

> result
          1      2         3      4
1  freshman junior    senior   
2  freshman junior sophomore senior
3  freshman junior sophomore senior
4 sophomore senior         
5 sophomore senior         
6    senior

I know that if I treated each row as a vector, I could do something like

careerrow <- c(1,2,3,3,4)
pairz <- lapply(careerrow,function(i){c(careerrow[i],careerrow[i+1])})
uniquepairz <- careerrow[sapply(pairz,function(x){x[1]!=x[2]})]

My difficulty is to apply that row-wise to my data table. I assume lapply is the way to go, but so far I am unable to solve this one.

mnel · Accepted Answer

If your aim is to calculate the total number of each pathway

You could use something like this (using data.table because of the nice way it handles lists as elements within a data.table (data.frame-like) object.

I am using !duplicated(...) to remove the duplicates as this is slightly more efficient than unique.

library(data.table)
library(reshape2)
# make the rownames a column 
test$id <- rownames(test)
# put in long format
DT <- as.data.table(melt(test,id='id'))
# get the unique steps and concatenate into a unique identifier for each pathway
DL <- DT[!is.na(value), {.steps <- value[!duplicated(value)]
  stepid <- paste(.steps, sep ='.',collapse = '.')
  list(steps = list(.steps), stepid =stepid)}, by=id]
##    id                            steps                           stepid
## 1:  1           freshman,junior,senior           freshman.junior.senior
## 2:  2 freshman,junior,sophomore,senior freshman.junior.sophomore.senior
## 3:  3 freshman,junior,sophomore,senior freshman.junior.sophomore.senior
## 4:  4                 sophomore,senior                 sophomore.senior
## 5:  5                 sophomore,senior                 sophomore.senior
## 6:  6                           senior                           senior

# count the number per path

DL[, .N, by = stepid]
##                              stepid N
## 1:           freshman.junior.senior 1
## 2: freshman.junior.sophomore.senior 2
## 3:                 sophomore.senior 2
## 4:                           senior 1

reduce row to unique items

Answers (2)

Related Questions