Reputation: 64

Randomize data between two columns in R

I have searched for an answer or a solution to this task with no success as of yet, so I do apologize if this is redundant.

I want to randomize the data between two columns. This is to simulate species misidentification in vegetation field data, so I want to assign some sort of probability of misidentification between the two columns as well. I would imagine that there is some way to do this using sample or the "permute" package.

I will select some readily available data for an example.

library (vegan)
data (dune)

If you type head (dune), then you can see that this is a data frame with sites as rows and species as columns. For convenience sake, we can presume some field tech has potential to misidentify Poa pratensis and Poa trivialis.

poa = data.frame(Poaprat=dune$Poaprat,Poatriv=dune$Poatriv)
head(poa)
           Poaprat      Poatriv
1             4            2
2             4            7
3             5            6
4             4            5
5             2            6
6             3            4

What would be the best way to randomize the values between these two columns (transferring between each other and/or adding to one when both are present). The resulting data may look like:

           Poaprat      Poatriv
1             6            0
2             4            7
3             5            6
4             5            4
5             0            7
6             4            3

P.S.

For the cringing ecologist out there: please realize, I have made this example in the interest of time and that I know relative cover values are not additive. I apologize for needing to do that.

*** Edit: For more clarity, the type of data being randomized would be percent cover estimates (so values between 0% and 100%). The data in this quick example are relative cover estimates, not counts.

Upvotes: 0

Answers (3)

NDD

Reputation: 64

I just wanted to check in since accepting the answer from hrbrmstr. Given a little bit of time today, I went ahead and made a function that does this task with some degree of flexibility. It allows for inclusion of multiple species pairs, different probabilities between different species pairs (asymmetry in different direction), and includes explicitly the probability of the value staying the same.

misID = function(X, species,probs = c(0.1,0.1,0,0.8)){
library(purrr)

X2 = X

if (!is.matrix(species) == T){
as.matrix(species)
}


if (!is.matrix(probs) == T){
probs=matrix(probs,ncol=4,byrow=T)
}

if (nrow(probs) == 1){
probs = matrix(rep(probs[1,],nrow(species)),ncol=4,byrow=T)
}

for (i in 1:nrow(species)){

Spp = data.frame(X[species[i,1]],X[species[i,2]])


mis = map2_df(Spp[1],Spp[2],function(x,y) {
  for(n in 1:length(x)) {
    what = sample(c('left', 'right', 'swap','same'), size=1,prob=probs[i,])
    switch(
      what,
      left = {
        x[n] = x[n] + y[n]
        y[n] = 0
      },
      right = {
        y[n] = x[n] + y[n]
        x[n] = 0
      },
      swap = {
        tmp = y[n]
        y[n] = x[n]
        x[n] = tmp
      },
      same = {
      x[n] = x[n]
      y[n] = y[n]
      }
    )
}
misSpp = data.frame(x,y)
colnames(misSpp) =c(names(Spp[1]),names(Spp[2]))
return(misSpp)

})
X2[names(mis[1])] = mis[1]
X2[names(mis[2])] = mis[2]
}
return(X2)
}

There are probably a number of minor inefficiencies in here, but by and large it does what I need it to do. Sorry that there are no comments, but I did figure out how to handle getting the shuffled data into the data frame easily.

Thanks for pointing out the "purrr" package for me and also the switch function.

Example:

library(vegan)
library(labdsv)
data(dune)

#First convert relative abundances to my best guess at the % values in Van der Maarel (1979)
code = c(1,2,3,4,5,6,7,8,9)
value = c(0.1,1,2.5,4.25,5.5,20,40,60.5,90)
veg = vegtrans(dune,code,value)

specpairs = matrix(c("Poaprat","Poatriv","Trifprat","Trifrepe"),ncol=2,byrow=T) #create matrix of species pairs
probmat = matrix(c(0.3,0,0,0.7,0,0.5,0,0.5),ncol=4,byrow=T)                     #create matrix of misclassification probabilities

veg2 = misID(veg,specpairs,probs = probmat) 

print(veg2)

Upvotes: 0

R. Schifini

Reputation: 9313

Here is my approach:

Let's define a function that will take a number of specimens (n) and a probability (p) that it could be labeled incorrectly. This function will sample a 1 with probability p and a 0 with 1-p. The sum of this random sampling will give how many of the n specimens were incorrect.

mislabel = function(x, p){
    N_mis = sample(c(1,0), x, replace = T, prob = c(p, 1-p))
    sum(N_mis)
}

Once defined the function, apply it to each column and store it into two new columns

p_miss = 0.3

poa$Poaprat_mislabeled = sapply(poa$Poaprat, mislabel, p_miss)
poa$Poatriv_mislabeled = sapply(poa$Poatriv, mislabel, p_miss)

The final number of specimens tagged for each species can be calculated by substracting the incorrect from same species and adding the incorrect from the other specimen.

poa$Poaprat_final = poa$Poaprat - poa$Poaprat_mislabeled + poa$Poatriv_mislabeled
poa$Poatriv_final = poa$Poatriv - poa$Poatriv_mislabeled + poa$Poaprat_mislabeled

Result:

> head(poa)
  Poaprat Poatriv Poaprat_mislabeled Poatriv_mislabeled Poaprat_final Poatriv_final
1       4       2                  0                  0             4             2
2       4       7                  1                  2             5             6
3       5       6                  0                  3             8             3
4       4       5                  1                  2             5             4
5       2       6                  0                  3             5             3
6       3       4                  1                  2             4             3

Complete procedure:

mislabel = function(x, p){
    N_mis = sample(c(1,0), x, replace = T, prob = c(p, 1-p))
    sum(N_mis)
}


p_miss = 0.3

poa$Poaprat_mislabeled = sapply(poa$Poaprat, mislabel, p_miss)
poa$Poatriv_mislabeled = sapply(poa$Poatriv, mislabel, p_miss)

poa$Poaprat_final = poa$Poaprat - poa$Poaprat_mislabeled + poa$Poatriv_mislabeled
poa$Poatriv_final = poa$Poatriv - poa$Poatriv_mislabeled + poa$Poaprat_mislabeled

The p_miss variable is the probability of labeling incorrectly both species. You could also use a different value for each to simulate a non symmetrical chance that it may be easier to mislabel one of them compared to the other.

Upvotes: 0

hrbrmstr

Reputation: 78842

You'll still need to replace the actual columns with the new ones and there may be a more elegant way to do this (it's late in EDT land) and you'll have to decide what else besides the normal distribution you'll want to use (i.e. how you'll replace sample()) but you get your swaps and adds with:

library(vegan)
library(purrr)

data(dune)

poa <- data.frame(
  Poaprat=dune$Poaprat,
  Poatriv=dune$Poatriv
)

map2_df(poa$Poaprat, poa$Poatriv, function(x, y) {
  for (i in 1:length(x)) {
    what <- sample(c("left", "right", "swap"), 1)
    switch(
      what,
      left={ 
        x[i] <- x[i] + y[i]
        y[i] <- 0
      },
      right={ 
        y[i] <- x[i] + y[i]
        x[i] <- 0
      },
      swap={
        tmp <- y[i]
        y[i] <- x[i]
        x[i] <- tmp
      }
    )
  }
  data.frame(Poaprat=x, Poatriv=y)
})

Upvotes: 1

Randomize data between two columns in R

Answers (3)

Related Questions