rnso
rnso

Reputation: 24545

find category from reference values put in columns in R

I have following data and code:

> dput(mydata)
structure(list(P3 = c(99.4, 105.8, 111.9), P5 = c(100.4, 106.9, 
113.1), P10 = c(102, 108.6, 114.9), P25 = c(104.8, 111.6, 118.1
), P50 = c(108, 115, 121.8), P75 = c(111.2, 118.6, 125.6), P90 = c(114.3, 
121.9, 129.1), P95 = c(116.1, 123.9, 131.3), P97 = c(117.4, 125.3, 
132.7), val = c(115.5, 112.7, 117)), .Names = c("P3", "P5", "P10", 
"P25", "P50", "P75", "P90", "P95", "P97", "val"), row.names = 7:9, class = "data.frame")
> 
> mydata
     P3    P5   P10   P25   P50   P75   P90   P95   P97   val
7  99.4 100.4 102.0 104.8 108.0 111.2 114.3 116.1 117.4 115.5
8 105.8 106.9 108.6 111.6 115.0 118.6 121.9 123.9 125.3 112.7
9 111.9 113.1 114.9 118.1 121.8 125.6 129.1 131.3 132.7 117.0

I want to create a new column 'categ' in mydata which will have the 'number' part of the name of first column (checked from left to right) which contains value larger than 'val' of that row.

Hence, I should get 95,50,25 in the new column.

I know of 'findInterval' and 'match' functions that are used for this kind of classification but I am not able to apply them to mydata. Thanks for your help.

Upvotes: 3

Views: 76

Answers (2)

Carl Witthoft
Carl Witthoft

Reputation: 21502

To answer the post-question about speed:

bigdat<-mydata
for(j in 1:10) bigdat<- rbind(bigdat,bigdat)
frist<-function(mydata) {
    indx <- max.col(mydata[,-10] >mydata$val,'first')
mydata$categ <- as.numeric(sub("[A-Z]+", "", names(mydata)[indx]))
}

sceond <- function(mydata) indx <- apply(mydata[,-10] > mydata$val, 1, function(x) names(which(x))[1]) 
library(microbenchmark)
microbenchmark(frist(bigdat),sceond(bigdat))

Unit: milliseconds
           expr       min        lq    median        uq      max neval
  frist(bigdat)  5.400829  5.688074  7.166702  7.816168 142.6927   100
 sceond(bigdat) 22.333659 24.442536 25.422791 26.984677 178.7408   100

EDIT: per akrun's comment, I added the same regex line to the sceond function, but it dosn't affect the timing:

sceond <- function(mydata) {
    indx <- apply(mydata[,-10] > mydata$val, 1, function(x) names(which(x))[1]) 
    mydata$categ <- as.numeric(sub("[A-Z]+", "", names(mydata)[indx]))
    }
Unit: milliseconds
           expr       min        lq    median        uq       max neval
  frist(bigdat)  5.315901  5.613826  6.940932  7.791208  29.15699   100
 sceond(bigdat) 22.359897 24.588688 25.636795 27.868710 359.79325   100

Upvotes: 1

akrun
akrun

Reputation: 887158

You could try

indx <- max.col(mydata[,-10] >mydata$val,'first')
mydata$categ <- as.numeric(sub("[A-Z]+", "", names(mydata)[indx]))
mydata$categ
#[1] 95 50 25

Or

indx <- apply(mydata[,-10] > mydata$val, 1, function(x) names(which(x))[1])

and then use sub as before

data

mydata <- structure(list(P3 = c(99.4, 105.8, 111.9), P5 = c(100.4, 106.9, 
113.1), P10 = c(102, 108.6, 114.9), P25 = c(104.8, 111.6, 118.1
), P50 = c(108, 115, 121.8), P75 = c(111.2, 118.6, 125.6), P90 = c(114.3, 
121.9, 129.1), P95 = c(116.1, 123.9, 131.3), P97 = c(117.4, 125.3, 
132.7), val = c(115.5, 112.7, 117)), .Names = c("P3", "P5", "P10", 
"P25", "P50", "P75", "P90", "P95", "P97", "val"), class = "data.frame",
row.names = c("7", "8", "9"))

Upvotes: 3

Related Questions