Corbjn
Corbjn

Reputation: 286

Sample from dataframe by groups based on a certain distribution

I'm trying to sample from a data frame but with the condition that the sample represents the distribution in terms of a certain criterion in my case. The data frame is structured like this:

df <- data.frame(Locaton = c('A', 'B', 'B', 'B', 'C', 'C', ...),
                 Veg_Species = c('X', 'Y', 'Z', 'Z', 'Z', 'Z', ...),
                 Date_Diff = c(2, 5, 2, 0, 4, 4, ...))

It is important to know that the number of each Veg_Species differs. That means X has 25 occurrences, Y 45 and Z 78 for example. And now I want to sample from the different Veg_Species based on the distribution in terms of Date_Diff of the smallest sample. In that case that would mean sampling from every species in terms of Date_diff distribution from X.

I thought that I can do that with dplyr:

sample.species <- df %>%
  filter(Veg_Species == 'Z') %>%
  sample_n(25, replace = TRUE)

But that obviously only samples randomly from all Veg_Species with the name Z.

How can I take the distribution into account too?

For a more detailed example, click here.

Upvotes: 2

Views: 1172

Answers (3)

Darren Tsai
Darren Tsai

Reputation: 35554

Maybe you can try kernel density estimation for the distribution of Date_Diff.

1. Data and package

df <- read.csv("http://www.sharecsv.com/dl/2a26bf2c69bfd76e8ddcecd1c3739a31/ex.csv", row.names = 1)
library(dplyr)

2. Find the smallest species

df %>% count(Species)

#                   Species  n
# 1 Adenostoma fasciculatum 95
# 2     Artemisia filifolia 26
# 3  Eriogonum fasciculatum 41
# 4              Tamarix L. 27

3. Kernel density estimation of distribution and linear interpolation

(Reference: https://stats.stackexchange.com/a/78775/218516)

val <- df$Date_Diff[df$Species == "Artemisia filifolia"]
dist.fun <- approxfun(density(val))

4. Sampling

(sample_n() has been superseded in favour of slice_sample() since dplyr 1.0.0.)

df2 <- df %>%
  group_by(Species) %>% 
  slice_sample(n = 26, weight_by = dist.fun(Date_Diff)) %>%
  ungroup()

5. Check

df2 %>% count(Species)

#   Species                     n
#   <chr>                   <int>
# 1 Adenostoma fasciculatum    26
# 2 Artemisia filifolia        26
# 3 Eriogonum fasciculatum     26
# 4 Tamarix L.                 26

Upvotes: 5

Gwang-Jin Kim
Gwang-Jin Kim

Reputation: 9865

The argument prob= in sample() is a vector of weights for each element of sample. My idea is to use the indexes for each row and the vector of weights for the sampling. That would preserve the distribution.

sample_by_distribution <- function(df, dist_weights_col, n, replace=FALSE) {
  sampled_indexes <- sample(x=1:nrow(df), size=n, replace=replace, prob = df[, dist_weights_col])
  df[sampled_indexes,]
}

Sample by taking into account the distribution weights in your case:

sample_df <- sample_by_distribution(df, "Date_Diff", 25, replace=FALSE)

This would sample 25 rows of df while the probability for each row follows the "Date_Diff" column. Therefore, the distribution of "Veg_Species" should be also preserved.

Upvotes: 0

Ben Norris
Ben Norris

Reputation: 5747

It looks to me like you want to sample over your data set, but maintain the distribution of Date_diff that is present in the X subset.

First you need to determine what is present in the X subset. I made some fake data that seems to look like yours:

set.seed(123)
df <- data.frame(Location = sample(LETTERS[1:3], 148, replace = TRUE),
                 Veg_Species = c(rep("X", 25), rep("Y", 45), rep("Z", 78)),
                 Date_Diff = trunc(runif(148, 0, 10)))

Now, we need the distribution of Date_Diff for Veg_Species = X. We can do that with dplyr:

library(dplyr)
x_dist <- df %>%
  filter(Veg_Species == "X") %>%
  group_by(Date_Diff) %>%
  summarize(count = n())
x_dist
A tibble: 8 x 2
  Date_Diff count
      <dbl> <int>
1         1     2
2         2     6
3         3     5
4         4     3
5         5     3
6         6     2
7         7     2
8         8     2

Now we filter the original data, nest_by(Date_Diff) and sample each data by the count in x_dist.

set.seed(345)
df_sample <- df %>%
  semi_join(x_dist) %>%  # Remove all rows with Date_Diff not in x_dist
  nest_by(Date_Diff) %>%
  inner_join(x_dist) %>% 
  mutate(data = list(data[sample(1:nrow(data), # sampling the data
                                 size = count, 
                                 replace = TRUE),])) %>%
  summarize(data) %>%    # unnesting the data
  select(Location, veg_Species, 
         Date_Diff, -count) # reordering columns and removing count
df_sample
# A tibble: 25 x 3
# Groups:   Date_Diff [8]
   Location Veg_Species Date_Diff
   <chr>    <chr>           <dbl>
 1 C        Z                   1
 2 A        Z                   1
 3 A        Y                   2
 4 C        Z                   2
 5 B        X                   2
 6 B        Z                   2
 7 B        X                   2
 8 B        X                   2
 9 A        Y                   3
10 A        X                   3
# ... with 15 more rows

Upvotes: 1

Related Questions