Reputation: 323
I have a data frame that contains numerical values 1:4 with some NA's. For each row, I would like to calculate the frequency (as a percentage) of the value with the fewest occurrences greater than 0.
Here is a sample data frame to work with.
df = as.data.frame(rbind(c(1,2,1,2,2,2,2,1,NA,2),c(2,3,3,2,3,3,NA,2,NA,NA),c(4,1,NA,NA,NA,1,1,1,4,4),c(3,3,3,4,4,4,NA,4,3,4)))
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 1 2 1 2 2 2 2 1 NA 2
2 2 3 3 2 3 3 NA 2 NA NA
3 4 1 NA NA NA 1 1 1 4 4
4 3 3 3 4 4 4 NA 4 3 4
I have 2 points that I am struggling with. 1) finding the lowest frequency of a value greater than 0 and 2)applying the function to each row of my data frame. When I started working on this function I implemented it using the code below, but it did not appear to be applied to every row. My result for value.1, value.2, etc was the same for every row.
Low_Freq = function(x){
value.1 = sum(x==1, na.rm=TRUE) #count the number of 1's per row
value.2 = sum(x==2, na.rm=TRUE) #count the number of 2's per row
value.3 = sum(x==3, na.rm=TRUE) #count the number of 3's per row
value.4 = sum(x==4, na.rm=TRUE) #count the number of 4's per row
num.values = rowSums(!is.na(x), na.rm=TRUE) #count total number of non-NA values in each row
#what is the minimum frequency value greater than 0 among value.1, value.2, value.3, and value.4 for EACH row?
min.value.freq = min(cbind(value.1,value.2,value.3,value.4))
out = min.value.freq/num.values #calculate the percentage of the minimum value for each row
}
df$Low_Freq = apply(df, 1, function(x))
Then I started using rowSums() to compute value.1, value.2, value.3, and value.4. This fixed my problem of counting value.1, value.2, etc for each row, however, I then had to apply the function without the use of apply() for it to run:
Low_Freq = function(x){
value.1 = rowSums(x==1, na.rm=TRUE) #count the number of 1's per row
value.2 = rowSums(x==2, na.rm=TRUE) #count the number of 2's per row
value.3 = rowSums(x==3, na.rm=TRUE) #count the number of 3's per row
value.4 = rowSums(x==4, na.rm=TRUE) #count the number of 4's per row
num.values = rowSums(!is.na(x), na.rm=TRUE) #count total number of non-NA values in each row
#what is the minimum frequency value greater than 0 among value.1, value.2, value.3, and value.4 for EACH row?
min.value.freq = min(cbind(value.1,value.2,value.3,value.4))
out = min.value.freq/num.values #calculate the percentage of the minimum value for each row
}
df$Low_Freq = Low_Freq(df)
So the act of applying to each row then seemed to occur within the function itself. That's all fine and dandy, but when I go to make my final calculation which will be my output, I cannot figure out how to identify which of values 1, 2, 3, or 4 has the lowest frequency for each row. This value must be divided by the number of non-NA values for each row.
My desired result should look like this:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 Low_Freq
1 1 2 1 2 2 2 2 1 NA 2 0.3333333
2 2 3 3 2 3 3 NA 2 NA NA 0.4285714
3 4 1 NA NA NA 1 1 1 4 4 0.4285714
4 3 3 3 4 4 4 NA 4 3 4 0.4444444
I feel like I am going in circles with this seemingly simple function. Any help would be appreciated.
Thank you.
Upvotes: 2
Views: 1497
Reputation: 44340
The table
function will return the frequency of each value that appears, ignoring NA
values. Therefore, the min
of the table
result is the minimum frequency of a value that shows up in your row, and the sum is the number of non-NA
values in your row.
Low_Freq = function(x){
tab = table(x)
return(min(tab) / sum(tab))
}
df$Low_Freq = apply(df, 1, Low_Freq)
df
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 Low_Freq
# 1 1 2 1 2 2 2 2 1 NA 2 0.3333333
# 2 2 3 3 2 3 3 NA 2 NA NA 0.4285714
# 3 4 1 NA NA NA 1 1 1 4 4 0.4285714
# 4 3 3 3 4 4 4 NA 4 3 4 0.4444444
If you wanted to not use 5s for the numerator but to use them for the denominator, you could do:
df = as.data.frame(rbind(c(1,2,1,2,2,2,2,1,NA,2),c(2,3,3,2,3,3,NA,2,NA,NA),c(4,1,NA,NA,NA,1,1,1,4,4),c(3,3,3,4,4,4,5,4,3,4)))
Low_Freq = function(x){
tab = table(x[x != 5])
return(min(tab) / sum(!is.na(x)))
}
df$Low_Freq = apply(df, 1, Low_Freq)
df
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 Low_Freq
# 1 1 2 1 2 2 2 2 1 NA 2 0.3333333
# 2 2 3 3 2 3 3 NA 2 NA NA 0.4285714
# 3 4 1 NA NA NA 1 1 1 4 4 0.4285714
# 4 3 3 3 4 4 4 5 4 3 4 0.4000000
Upvotes: 3