Evan
Evan

Reputation: 1499

How to calculate a proportion of columns meeting a threshold in R?

I have data in R in a numeric class in the form:

Input_SNP Set_1 Set_2 Set_3 Set_4 Set_5 Set_5 10.67 7.91 6.98 7.93 7.70 11.15 8.58

I actually have 500 sets. I would like to calculate the proportion of Sets that have a value greater than or equal to my Input_SNP column. For example, this has 1 value (11.15) greater than or equal to 10.67. So I would like 1/(number of sets). I'm sure this is simple, how can it be done?

Upvotes: 0

Views: 830

Answers (2)

user5363218
user5363218

Reputation:

data = read.table(header = T,  text  = "Input_SNP     Set_1     Set_2     Set_3     Set_4     Set_5      Set_5
10.67          7.91      6.98      7.93      7.70      11.15      8.58")

# Compare all the values (except the first) to the first
data[,-1] > data$Input_SNP
# Set_1 Set_2 Set_3 Set_4 Set_5 Set_5.1
# [1,] FALSE FALSE FALSE FALSE  TRUE   FALSE


# Get the length of "true" index 
length(which(data[,-1] > data$Input_SNP)) / (ncol(data) - 1)
# 0.1666667

If you don't want to use dataframes, he following uses a matrix:

data = read.table(header = T,  text  = "Input_SNP     Set_1     Set_2     Set_3         Set_4     Set_5      Set_5
10.67          7.91      6.98      7.93      7.70      11.15      8.58")

# Generate some further random data to verify correct row indexing 
data = rbind(data, runif(n = ncol(data), min = 5, max = 15))
data = as.matrix(data)

# Input_SNP    Set_1    Set_2    Set_3    Set_4     Set_5 Set_5.1
# 1 10.670000 7.910000  6.98000  7.93000 7.700000 11.150000  8.5800
# 2  6.670087 5.308156 12.81796 13.40233 7.753867  5.049444 14.5793



logicalResults = apply(X = data, MARGIN = 1, FUN = function(x){x[1] <= x[-1]})
logicalResults = t(logicalResults)

#   Set_1 Set_2 Set_3 Set_4 Set_5 Set_5.1
# 1 FALSE FALSE FALSE FALSE  TRUE   FALSE
# 2 FALSE  TRUE  TRUE  TRUE FALSE    TRUE


apply(X = logicalResults, MARGIN = 1, FUN = function(x){length(which(x[-1] == T))}) / ncol(logicalResults)
# 1         2 
# 0.1666667 0.6666667 

Upvotes: 1

Pierre L
Pierre L

Reputation: 28461

Whether it is a data frame of matrix, you can try:

rowMeans(df[,-1] > df[,1], na.rm=TRUE)
#[1] 0.1666667

Or if we extend the data using your last question it still works:

rowMeans(df[,-1] > df[,1], na.rm=TRUE)
#[1] 0.4000000 1.0000000       NaN 0.0000000 0.2000000 0.2000000 0.1666667

And also to make sure it works for matrices:

mat <- as.matrix(df)
rowMeans(mat[,-1] > mat[,1], na.rm=TRUE)
#[1] 0.4000000 1.0000000       NaN 0.0000000 0.2000000 0.2000000 0.1666667

extended data

df <- read.table(text="Input_SNP   Set_1    Set_2     Set_3     Set_4     Set_5     Set_6
1.09        0.162    NA        2.312     1.876     0.12      0.812
0.687       NA       0.987     1.32      1.11      1.04      NA
NA          1.890    0.923     1.43      0.900     2.02      2.7
2.801       0.642    0.791     0.812     NA        0.31      1.60
1.33        1.33     NA        1.22      0.23      0.18      1.77
2.91        1.00     1.651     NA        1.55      3.20      0.99
2.00        2.31     0.89      1.13      1.25      0.12      1.55", header=T)

Update

If you are comparing the data frame to a numeric vector, you will not need the dimensions of the second as it does not have dimensions:

rowMeans(df[-1] > my_vector, na.rm=T)

Upvotes: 1

Related Questions