rg255
rg255

Reputation: 4169

Randomise blocks of values

I have a dataset where I have several variables - measurements of length, width etc - and a grouping variable - line. I want to randomise the data in such a way that the between line variances remain, but line covariances among variables are broken.

Using iris as an example, here I can get the blocks of values for each species to stay together by the grouping variable of species and get randomised to new species for each trait individually - exactly as I want - but it is also putting NAs in the data. How can I get this to reduce so that the shape is the same as the original data?

library(data.table)
set.seed(21)

dtIris <- data.table(id = rep(1:9, times = 1), iris[c(1:3, 51:53, 101:103), ])

dtIris 

dcast(
  melt(dtIris, id.vars = c('id', 'Species'))[
    melt(dtIris, id.vars = c('id', 'Species'))[, 
      .('Species' = unique(Species), 'new' = sample(unique(Species))), by = variable], 
    on = c('Species', 'variable')][, -c('Species')], 
  ... ~ variable, value.vars = 'value')

This is putting the data in long format, sampling unique values of species for each trait, merging that back onto the data in long format, then spreading it back to wide format. It is leaving NAs where new != Species.

    id        new Sepal.Length Sepal.Width Petal.Length Petal.Width
 1:  1     setosa           NA         3.5          1.4          NA
 2:  1  virginica          5.1          NA           NA         0.2
 3:  2     setosa           NA         3.0          1.4          NA
...

Upvotes: 0

Views: 46

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 388982

Too long for a comment, hence posting an answer. The reason why you have NA's is because you have sampled the data for each variable hence the proportion between your new variable and id have changed.

library(data.table)

dt1 <- melt(dtIris, id.vars = c('id', 'Species'))

dt2 <- dt1[dt1[,.('Species' = unique(Species), 'new' = sample(unique(Species))), 
                by = variable], on = c('Species', 'variable')]

Previously what you had was

table(dt2$Species, dt2$id)

#             1 2 3 4 5 6 7 8 9
#  setosa     4 4 4 0 0 0 0 0 0
#  versicolor 0 0 0 4 4 4 0 0 0
#  virginica  0 0 0 0 0 0 4 4 4

and now what you have is :

table(dt2$new, dt2$id)
#             1 2 3 4 5 6 7 8 9
#  setosa     1 1 1 0 0 0 3 3 3
#  versicolor 3 3 3 0 0 0 1 1 1
#  virginica  0 0 0 4 4 4 0 0 0

As you can see previously every id had only one Species in it but after sampling it is not true (see id = 1 has both "setosa" and "versicolor" whereas previously it had only "setosa"). There is no way id = 1 can have new variable in 1 row when their values are different.

Upvotes: 1

Related Questions