Reputation: 2575
I have a data like this, where Average is the average of X, Y, and Z.
head(df)
ID X Y Z Average
A 2 2 5 3
A 4 3 2 3
A 4 3 2 3
B 5 3 1 3
B 3 4 2 3
B 1 5 3 3
C 5 3 1 3
C 2 3 4 3
C 5 3 1 3
D 2 3 4 3
D 3 2 4 3
D 3 2 4 3
E 5 3 1 3
E 4 3 2 3
E 3 4 2 3
To reproduce this, we can use
df <- data.frame(ID = c("A", "A", "A", "B", "B", "B", "C", "C", "C", "D", "D", "D", "E", "E", "E"),
X = c(2L, 4L, 4L, 5L, 3L,1L, 5L, 2L, 5L, 2L, 3L, 3L, 5L, 4L, 3L),
Y = c(2L, 3L, 3L, 3L,4L, 5L, 3L, 3L, 3L, 3L, 2L, 2L, 3L, 3L, 4L),
Z = c(5L, 2L, 2L,1L, 2L, 3L, 1L, 4L, 1L, 4L, 4L, 4L, 1L, 2L, 2L),
Average = c(3L,3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L))
From this, I want to extract one observation per ID such that we don't get same (as much as is possible) values of the combination of X, Y, and Z. I tried
library(dplyr)
df %>% sample_n(size = nrow(.), replace = FALSE) %>% distinct(ID, .keep_all = T)
But, on a larger dataset, I see too many repetitions of the combination of X, Y, Z. To the extent possible, I need the output with equal or close to equal representation of cases (i.e. the combination of X, Y, Y) like this:
ID X Y Z Average
A 2 2 5 3
B 5 3 1 3
C 2 3 4 3
D 3 2 4 3
E 4 3 2 3
Upvotes: 1
Views: 3183
Reputation: 511
I just want to add to the answer from lefft. I modified the code to allow sample sizes and not dependent on having a column called ID, or X,Y,Z, or any of that
id_sample <- function(df, id, field, sampleSize) {
df = df %>%
filter(!!as.symbol(field) == id)
return(df[sample(1:nrow(df), size = sampleSize, replace = TRUE),])
}
make_sample_df <- function(dat, sampleSize, field) {
# initialize empty data frame to fill with row samples
out <- dat[NULL,]
# get the unique id's to loop over
ids <- unique(dat[[field]])
for (id in ids) {
# grab a sample
id_row <- id_sample(dat, id, field,sampleSize)
out <- rbind(out, id_row)
}
return(out)
}
And then how I use this is
sample_df = make_sample_df(df, 20, "ColumnToSampleOn")
where df is my original dataframe, 20 is the count of rows I want for each unique value found in the "ColumnToSampleOn"
Upvotes: 1
Reputation: 145755
This seems dubious, but try this:
library(dplyr)
df %>% add_count(X, Y, Z) %>%
group_by(ID) %>%
top_n(-1, n) %>%
arrange(runif(n)) %>%
select(-n) %>%
slice(1)
# # A tibble: 5 x 5
# # Groups: ID [5]
# ID X Y Z Average
# <fctr> <int> <int> <int> <int>
# 1 A 2 2 5 3
# 2 B 1 5 3 3
# 3 C 2 3 4 3
# 4 D 3 2 4 3
# 5 E 3 4 2 3
This picks the least common XYZ combo from each ID, and in case of a tie chooses randomly. Extremely common XYZ combos may be missing entirely.
Upvotes: 2
Reputation: 2105
Here's one way to do it. Explanations inline. Note that the implementation can be tweaked depending on the desired notion of "similarity"/"duplication" between rows.
# get the sample data from the original post
dat <- data.frame(
ID = c("A","A","A","B", "B", "B", "C", "C", "C", "D", "D", "D", "E", "E", "E"),
X = c(2L, 4L, 4L, 5L, 3L,1L, 5L, 2L, 5L, 2L, 3L, 3L, 5L, 4L, 3L),
Y = c(2L, 3L, 3L, 3L,4L, 5L, 3L, 3L, 3L, 3L, 2L, 2L, 3L, 3L, 4L),
Z = c(5L, 2L, 2L,1L, 2L, 3L, 1L, 4L, 1L, 4L, 4L, 4L, 1L, 2L, 2L),
Average = c(3L,3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L))
Define a function that samples a row of a data frame given an id (assumes $ID
column exists):
# function to get a randomly sampled row from `df` with `df$ID==id`
id_sample <- function(df, id){
df <- df[df$ID==id, ]
return(df[sample(1:nrow(df), size=1), ])
}
Define a function that will loop over each id, rejecting a sample if it is "too similar" to the rows already in the data frame:
make_sample_df <- function(dat, threshold){
# initialize empty data frame to fill with row samples
out <- dat[NULL, ]
# get the unique id's to loop over
ids <- unique(dat$ID)
for (id in ids){
# grab a sample
id_row <- id_sample(dat, id)
# see how many of its elements have column-duplicates (zero for first id)
n_dupes <- sum(apply(out, 1, function(row){
sum(row[1]==id_row$X, row[2]==id_row$Y, row[3]==id_row$Z)}))
# keep getting samps if the number of duplicates is higher than threshold
while (n_dupes > threshold){
id_row <- id_sample(dat, id)
n_dupes <- sum(apply(out, 1, function(row){
sum(row[1]==id_row$X, row[2]==id_row$Y, row[3]==id_row$Z)}))
}
# once we have a suitable row for `id`, add it to the output df
out <- rbind(out, id_row)
}
return(out)
}
Now apply the func to OP's data and inspect:
# rows at most 1 of whose values appear in another row (at same column)
set.seed(6933)
make_sample_df(dat, threshold=1)
## ID X Y Z Average
## A 4 3 2 3
## B 1 5 3 3
## C 2 3 4 3
## D 3 2 4 3
## E 5 3 1 3
Depending on your needs, you can try different threshold
values, though note that if you're too strict about threshold
, the while
loop could keep running forever, so you might want to put some kind of escape hatch in there.
You can also adapt this strategy to different variations, e.g. where you care about values getting duplicated within rows, instead of columns.
Hope this helps ~~
Upvotes: 1
Reputation: 20085
One possible solution could be by excluding
values which has been sampled for previous column.
library(dplyr)
df %>% group_by(ID) %>%
summarise(XX = sample(unique(X),1),
YY = sample(unique(Y[Y!=XX]),1),
ZZ = sample(unique(Z[Z!=XX & Z!=YY]),1),
Average = mean(Average))
# # A tibble: 5 x 5
# ID XX YY ZZ Average
# <fctr> <int> <int> <int> <dbl>
# 1 A 4 2 1 3.00
# 2 B 5 4 2 3.00
# 3 C 5 3 1 3.00
# 4 D 2 3 2 3.00
# 5 E 5 4 2 3.00
I have changed column names just make the logic clear.
Upvotes: 0