D.Singleton
D.Singleton

Reputation: 179

Variable length formula construction

I am trying to apply the Simpson's Diversity Index across a number of different datasets with a variable number of species ('nuse') captured. As such I am trying to construct code which can cope with this automatically without needing to manually construct a formula each time I do it. Example dataset for a manual formula is below:

diverse <- data.frame(nuse1=c(0,20,40,20), nuse2=c(5,5,3,20), nuse3=c(0,2,8,20), nuse4=c(5,8,2,20), total=c(10,35,53,80))

simp <- function(x) { 
    total <- x[,"total"]
    nuse1 <- x[,"nuse1"]
    nuse2 <- x[,"nuse2"]
    nuse3 <- x[,"nuse3"]
    nuse4 <- x[,"nuse4"]

    div <- round(((1-(((nuse1*(nuse1 - 1)) + (nuse2*(nuse2 - 1)) + (nuse3*(nuse3 - 1)) + (nuse4*(nuse4 - 1)))/(total*(total - 1))))),digits=4)
        return(div)
        }

diverse$Simpson <- simp(diverse)
diverse

As you can see this works fine. However, how would I be able to create a function which could automatically adjust to, for example, 9 species (so up to nuse9)?

I have experimented with the paste function + as.formula as indicated here Formula with dynamic number of variables; however it is the expand form of (nuse1 * (nuse1 - 1)) that I'm struggling with. Does anyone have any suggestions please? Thanks.

Upvotes: 0

Views: 57

Answers (1)

Evan Friedland
Evan Friedland

Reputation: 3194

How about something like:

diverse <- data.frame(nuse1=c(0,20,40,20), nuse2=c(5,5,3,20), nuse3=c(0,2,8,20), nuse4=c(5,8,2,20), total=c(10,35,53,80))

simp <- function(x, species) { 
  spcs <- grep(species, colnames(x)) # which column names have "nuse"
  total <- rowSums(x[,spcs]) # sum by row
  div <- round(1 - rowSums(apply(x[,spcs], 2, function(s) s*(s-1))) / (total*(total - 1)), digits = 4)
  return(div)
}

diverse$Simpson2 <- simp(diverse, species = "nuse")
diverse

# nuse1 nuse2 nuse3 nuse4 total Simpson2
# 1     0     5     0     5    10   0.5556
# 2    20     5     2     8    35   0.6151
# 3    40     3     8     2    53   0.4107
# 4    20    20    20    20    80   0.7595

All it does is find out which columns start with "nuse" or any other species you have in your dataset. It constructs the "total" value within the function and does not require a total column in the dataset.

Upvotes: 1

Related Questions