Reputation: 6649
I have a dataframe of values and for each value in the dataframe I want to determine if it is within say 10% of any other value in its row. I want to do this generically as I do not know how many columns I will have nor the names of the columns.
Some values are NA, if all other values in the row are NA I want to return TRUE. For the actual values which are NA I want to return FALSE. The values are all positive but can be 0.
For example say I have the follwoing dataframe
dataDF <- data.frame(
a = c(100, 250, NA, 700, 0),
b = c(105, 300, 280, NA, 0),
c = c(200, 400, 280, NA, 0)
)
In the first row we have a = 100, b = 105 and c = 200. a and b are within 10% of each other so we would have TRUE for both of those, c is not within 10% of either a or b so would be FALSE.
In the second row no values are within 10% of each other so all would be FALSE
In the third row b and c are equal so are TRUE, a is NA so is FALSE.
In the fourth row we only have a value for a so it is returned as TRUE, b and c are FALSE
In the final row all values are the same, so we would have TRUE for all
So my output would be
data.frame(
a = c( TRUE, FALSE, FALSE, TRUE, TRUE),
b = c( TRUE, FALSE, TRUE, FALSE, TRUE),
c = c(FALSE, FALSE, TRUE, FALSE, TRUE)
)
How I calculate the percentage difference doesn't really matter but they way I was going to do it would be to divide the absolute difference by the average of the 2 values so that I get the same value whichever way I look at it.
So for example to calculate the percentage difference between 100 and 105 it would be:
abs(100 - 105)/((100 + 105)/2) = 5/102.5 = 0.0488
Any ideas on the quickest and neatest way of doing this would be appreciated.
Thanks
Upvotes: 2
Views: 1745
Reputation: 31181
Define a function an apply it on each row of your data.frame:
fun <- function(vec)
{
n = length(vec)
if(all(is.na(vec)))
return(rep(FALSE,n))
noNA = vec[!is.na(vec)]
if(length(unique(noNA))==1)
return(!is.na(vec))
res = rep(FALSE, n)
for(i in 1:n)
if(any(abs(vec[i]-vec[-i])<=vec[-i]*0.1, na.rm = TRUE))
res[i] = TRUE
res
}
output=data.frame(t(apply(dataDF,1,fun)))
names(output) = names(dataDF)
output
Gives the wanted result:
# a b c
#1 TRUE TRUE FALSE
#2 FALSE FALSE FALSE
#3 FALSE TRUE TRUE
#4 TRUE FALSE FALSE
#5 TRUE TRUE TRUE
Upvotes: 2