Reputation: 286
I'm trying to sample from a data frame but with the condition that the sample represents the distribution in terms of a certain criterion in my case. The data frame is structured like this:
df <- data.frame(Locaton = c('A', 'B', 'B', 'B', 'C', 'C', ...),
Veg_Species = c('X', 'Y', 'Z', 'Z', 'Z', 'Z', ...),
Date_Diff = c(2, 5, 2, 0, 4, 4, ...))
It is important to know that the number of each Veg_Species
differs. That means X
has 25 occurrences, Y
45 and Z
78 for example. And now I want to sample from the different Veg_Species
based on the distribution in terms of Date_Diff
of the smallest sample. In that case that would mean sampling from every species in terms of Date_diff
distribution from X
.
I thought that I can do that with dplyr
:
sample.species <- df %>%
filter(Veg_Species == 'Z') %>%
sample_n(25, replace = TRUE)
But that obviously only samples randomly from all Veg_Species
with the name Z
.
How can I take the distribution into account too?
For a more detailed example, click here.
Upvotes: 2
Views: 1172
Reputation: 35554
Maybe you can try kernel density estimation for the distribution of Date_Diff
.
1. Data and package
df <- read.csv("http://www.sharecsv.com/dl/2a26bf2c69bfd76e8ddcecd1c3739a31/ex.csv", row.names = 1)
library(dplyr)
2. Find the smallest species
df %>% count(Species)
# Species n
# 1 Adenostoma fasciculatum 95
# 2 Artemisia filifolia 26
# 3 Eriogonum fasciculatum 41
# 4 Tamarix L. 27
3. Kernel density estimation of distribution and linear interpolation
(Reference: https://stats.stackexchange.com/a/78775/218516)
val <- df$Date_Diff[df$Species == "Artemisia filifolia"]
dist.fun <- approxfun(density(val))
4. Sampling
(sample_n()
has been superseded in favour of slice_sample()
since dplyr 1.0.0
.)
df2 <- df %>%
group_by(Species) %>%
slice_sample(n = 26, weight_by = dist.fun(Date_Diff)) %>%
ungroup()
5. Check
df2 %>% count(Species)
# Species n
# <chr> <int>
# 1 Adenostoma fasciculatum 26
# 2 Artemisia filifolia 26
# 3 Eriogonum fasciculatum 26
# 4 Tamarix L. 26
Upvotes: 5
Reputation: 9865
The argument prob=
in sample()
is a vector of weights for each element of sample. My idea is to use the indexes for each row and the vector of weights for the sampling. That would preserve the distribution.
sample_by_distribution <- function(df, dist_weights_col, n, replace=FALSE) {
sampled_indexes <- sample(x=1:nrow(df), size=n, replace=replace, prob = df[, dist_weights_col])
df[sampled_indexes,]
}
Sample by taking into account the distribution weights in your case:
sample_df <- sample_by_distribution(df, "Date_Diff", 25, replace=FALSE)
This would sample 25 rows of df
while the probability for each row follows the "Date_Diff" column. Therefore, the distribution of "Veg_Species" should be also preserved.
Upvotes: 0
Reputation: 5747
It looks to me like you want to sample over your data set, but maintain the distribution of Date_diff
that is present in the X subset.
First you need to determine what is present in the X subset. I made some fake data that seems to look like yours:
set.seed(123)
df <- data.frame(Location = sample(LETTERS[1:3], 148, replace = TRUE),
Veg_Species = c(rep("X", 25), rep("Y", 45), rep("Z", 78)),
Date_Diff = trunc(runif(148, 0, 10)))
Now, we need the distribution of Date_Diff
for Veg_Species = X
. We can do that with dplyr
:
library(dplyr)
x_dist <- df %>%
filter(Veg_Species == "X") %>%
group_by(Date_Diff) %>%
summarize(count = n())
x_dist
A tibble: 8 x 2
Date_Diff count
<dbl> <int>
1 1 2
2 2 6
3 3 5
4 4 3
5 5 3
6 6 2
7 7 2
8 8 2
Now we filter the original data, nest_by(Date_Diff)
and sample each data
by the count
in x_dist
.
set.seed(345)
df_sample <- df %>%
semi_join(x_dist) %>% # Remove all rows with Date_Diff not in x_dist
nest_by(Date_Diff) %>%
inner_join(x_dist) %>%
mutate(data = list(data[sample(1:nrow(data), # sampling the data
size = count,
replace = TRUE),])) %>%
summarize(data) %>% # unnesting the data
select(Location, veg_Species,
Date_Diff, -count) # reordering columns and removing count
df_sample
# A tibble: 25 x 3
# Groups: Date_Diff [8]
Location Veg_Species Date_Diff
<chr> <chr> <dbl>
1 C Z 1
2 A Z 1
3 A Y 2
4 C Z 2
5 B X 2
6 B Z 2
7 B X 2
8 B X 2
9 A Y 3
10 A X 3
# ... with 15 more rows
Upvotes: 1