abeckers
abeckers

Reputation: 1

Divide data into multiple groups as similar as possible

I have a data frame in the format

> head(daten_strat)
   id age gender anxiety
1   7  40      2       7
2   3  53      1       8
3   4  40      1       4
4   1  62      2       8
5   5  60      2      11
6   6  45      1       8

I would like to create 4 random groups that are as similar as possible in terms of the distribution of gender, age and anxiety.

In a university course, we plan an intervention with 4 different conditions. In order to assign the participants to the 4 conditions, I would like to use R to perform a stratified randomization. As a final result, I would like to have 4 groups as similar as possible in terms of age, gender, and level of anxiety. So that (somewhat simplified) differences in effectiveness cannot be attributed to demographic differences between the groups.

Upvotes: 0

Views: 467

Answers (1)

shs
shs

Reputation: 3901

I would not call this task stratified sampling, you are not trying to get a representative sample of a population. What you are looking to do is partitioning. The anticlust package with its anticlustering() function provides a number of methods for this task. I'll show a basic example with defaults below. You might want to look into the methods more deeply if you want to use the partitioning for research purposes.

library(tidyverse)
library(anticlust)
set.seed(42)

# Example data
dat <- tibble(
  id = as.character(1:100),
  age = rnorm(100, 50, 10) |> round(),
  gender = sample(1:2, 100, T),
  anxiety = rnorm(100, 7.5, 2.25) |> round()
)

dat <- dat |> 
  mutate(group = anticlustering(dat[, -1], K = 4)) # Basic usage with defaults 
dat
#> # A tibble: 100 × 5
#>    id      age gender anxiety group
#>    <chr> <dbl>  <int>   <dbl> <dbl>
#>  1 1        64      2       7     2
#>  2 2        44      2       4     1
#>  3 3        54      1      10     4
#>  4 4        56      2       7     3
#>  5 5        54      1       6     3
#>  6 6        49      1       5     3
#>  7 7        65      2       7     3
#>  8 8        49      2       6     2
#>  9 9        70      2       6     1
#> 10 10       49      2      10     2
#> # … with 90 more rows

As you can see below, the between-group variance for all variables is fairly low.

# Means across groups
dat |> 
  group_by(group) |> 
  summarize(across(age:anxiety, mean))
#> # A tibble: 4 × 4
#>   group   age gender anxiety
#>   <dbl> <dbl>  <dbl>   <dbl>
#> 1     1  50.3   1.48    7.48
#> 2     2  50.2   1.44    7.52
#> 3     3  50.5   1.44    7.4 
#> 4     4  50.2   1.44    7.44

Upvotes: 1

Related Questions