Reputation: 477
This is a very specific question: I have a set of observations, from a number of subjects, along several years (only one observation per year). I want to select only one observation of each individual, (I don't care from which year) in a way that I end up with a similar number of observations per year, and as random as possible.
So, starting from df
in which 1 are years there is an observation for that individual, and 0 are years where there's no observation for that individual:
df <- data.frame(Ind = c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k"),
Year1 = c(1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0),
Year2 = c(0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0),
Year3 = c(1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1),
Year4 = c(0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1))
which looks like
I'd like to end up with something like this
EDIT: trying to apply solutions proposed (and failing)
(1) earch's answer:
df <- as_tibble(df)
year.weights <- df %>%
gather(Year, Obs, -Ind) %>%
group_by(Year) %>%
summarize(wt = sum(Obs)) %>%
ungroup
df %>%
gather(Year, Obs, -Ind) %>%
filter(Obs == 1) %>%
left_join(year.weights, by = "Year") %>%
group_by(Ind) %>%
sample_n(1, weight = 1 / wt) %>%
select(-wt) %>%
spread(Year, Obs) %>%
ungroup
This gives an error Error: 'by' can't contain join column 'Year' which is missing from RHS
which appears in the left_join
step. I try to solve this by giving the name "Year" to the only variable in RHS
names(year.weights) <- "Year"
But now this gives a new error: Error in left_join_impl(x, y, by_x, by_y, aux_x, aux_y, na_matches) : Can't join on 'Year' x 'Year' because of incompatible types (numeric / character)
which actually makes a lot of sense as Year column in LHS contains Year1, Year2, Year3, etc. while Year column in RHS contains the number 27.
This is as far as I got because I can't see what earch was trying to accomplish, but I do believe a feasible solution can be achieved through this n_sample and the weight argument, but I can't quite see it yet.
(2) Mikey's answer:
This works well (I don't get the error I was getting before) but it doesn't guarantee that I'll get an equal (or similar) number of 1s for each "Year" column.
So, if I run the code a couple of times for testing, I get:
# first time
[,1] [,2] [,3] [,4]
[1,] 0 0 0 1
[2,] 1 0 0 0
[3,] 0 0 1 0
[4,] 0 1 0 0
[5,] 1 0 0 0
[6,] 0 0 1 0
[7,] 0 0 0 1
[8,] 0 1 0 0
[9,] 0 0 0 1
[10,] 0 0 0 1
[11,] 0 0 0 1
# second time
[,1] [,2] [,3] [,4]
[1,] 1 0 0 0
[2,] 1 0 0 0
[3,] 0 0 1 0
[4,] 0 1 0 0
[5,] 0 0 0 1
[6,] 1 0 0 0
[7,] 1 0 0 0
[8,] 0 0 0 1
[9,] 0 0 0 1
[10,] 0 0 0 1
[11,] 0 0 1 0
(3) Andre Elrico's answer:
It has the same problem as answer (2), it doesn't guarantee an equal number of 1s for each Year: see two random outputs:
# fist try
Ind Year1 Year2 Year3 Year4
1 a NA NA NA 1
2 b NA NA 1 NA
3 c NA NA 1 NA
4 d NA 1 NA NA
5 e 1 NA NA NA
6 f NA NA 1 NA
7 g 1 NA NA NA
8 h NA NA NA 1
9 i NA NA NA 1
10 j NA NA NA 1
11 k NA NA 1 NA
# second try
Ind Year1 Year2 Year3 Year4
1 a 1 NA NA NA
2 b 1 NA NA NA
3 c NA NA 1 NA
4 d NA NA 1 NA
5 e NA 1 NA NA
6 f NA NA NA 1
7 g NA NA NA 1
8 h NA NA NA 1
9 i NA NA NA 1
10 j NA 1 NA NA
11 k NA NA 1 NA
(4) paoloeusebi's answer has the same problem as the previous ones. Not guaranteeing equal number of selected 1s per row:
# first try
Ind Year1 Year2 Year3 Year4
1 a 1 NA NA NA
2 b NA NA NA 0
3 c NA NA 1 NA
4 d NA NA NA 0
5 e NA NA 1 NA
6 f NA NA NA 1
7 g 1 NA NA NA
8 h NA NA 0 NA
9 i NA NA NA 1
10 j NA NA NA 1
11 k NA NA 1 NA
# second try
Ind Year1 Year2 Year3 Year4
1 a NA NA NA 1
2 b NA 0 NA NA
3 c NA 1 NA NA
4 d NA NA NA 0
5 e NA NA NA 1
6 f NA 0 NA NA
7 g NA 0 NA NA
8 h NA NA 0 NA
9 i NA NA 0 NA
10 j NA NA 0 NA
11 k NA 0 NA NA
Upvotes: 1
Views: 120
Reputation: 2188
Here's some code. Maybe not so elegant, but it's a start:
new_mat = function(df, max_iter = 100){
ind_names <- df[,1]
df <- df[,-1]
n = NROW(df)
k = NCOL(df)
max_col = ceiling(n / k)
resample = function(x, ...) x[sample.int(length(x), ...)]
one_hot = function(i, n){
x = double(n)
x[i] = 1
return (x)
}
counter = 0
flag = TRUE
while (flag && counter <= max_iter){
counter = counter + 1
out = matrix(0, n, k)
weights = rep(max_col, k)
index = sample(1:n)
c2 = 0
for (i in index){
ind = which(df[i,] == 1)
probs = weights[ind]
if (max(probs) == 0)
break
out[i,] = one_hot(resample(ind, size = 1, prob = probs), k)
weights = weights - out[i,]
c2 = c2 + 1
}
if (c2 == length(index))
flag = FALSE
}
if (flag)
stop('No matrix found. Try again.')
final <- cbind(ind_names, as.data.frame(out))
names(final) <- c("ind", names(df))
return (final)
}
Requiring each column to have (roughly) the same number of occurrences poses quite a problem if you also want to randomize which column is selected. This is especially a problem when not all rows have observations present in each column. There is an implied dependence from row to row, which is probably not desirable.
Basically, this eventually sets a column's weight of being selected down to zero, once that column as reached max_col
, or the maximum number of occurrences after which the columns cannot have roughly the same amount. (I borrowed earch's idea of weighting the columns.)
If a problem occurs (e.g. no column can be selected for the next row which has weight>0
), then the process is re-run, up to a maximum of max_iter
, but going through a different order of the rows.
A major drawback of this is having to repeatedly loop through all the rows. I'm not sure of a way around this given your constraints. So if you have a really large data frame, you might see some long computation times. But on the data set you provided, the function return a matrix usually after just one pass, no more than a few.
Upvotes: 1
Reputation: 176
If you wanted a random Year that was 1 for each individual, then here is a dplyr/tidyr method:
> df <- data.frame(Ind = c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k"),
+ Year1 = c(1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0),
+ Year2 = c(0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0),
+ Year3 = c(1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1),
+ Year4 = c(0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1))
>
> year.weights <- df %>%
+ gather(Year, Obs, -Ind) %>%
+ group_by(Year) %>%
+ summarize(wt = sum(Obs)) %>%
+ ungroup
>
> year.weights
# A tibble: 4 x 2
Year wt
<chr> <dbl>
1 Year1 7
2 Year2 5
3 Year3 7
4 Year4 7
>
>
> df %>%
+ gather(Year, Obs, -Ind) %>%
+ filter(Obs == 1) %>%
+ left_join(year.weights, by = "Year") %>%
+ group_by(Ind) %>%
+ sample_n(1, weight = 1 / wt) %>%
+ select(-wt) %>%
+ spread(Year, Obs) %>%
+ ungroup
# A tibble: 11 x 5
Ind Year1 Year2 Year3 Year4
<fct> <dbl> <dbl> <dbl> <dbl>
1 a 1 NA NA NA
2 b NA NA 1 NA
3 c NA 1 NA NA
4 d 1 NA NA NA
5 e NA NA 1 NA
6 f 1 NA NA NA
7 g NA NA NA 1
8 h NA NA NA 1
9 i NA NA NA 1
10 j NA 1 NA NA
11 k NA NA NA 1
Upvotes: 2
Reputation: 11480
m <- df[-1]
IND <- rowSums(m) > 0
m[] <- NA
m[cbind(which(IND),max.col(df[-1])[IND])] <- 1
cbind(df[1],m)
result:
# Ind Year1 Year2 Year3 Year4
#1 a 1 NA NA NA
#2 b NA NA 1 NA
#3 c NA NA 1 NA
#4 d NA NA 1 NA
#5 e NA NA 1 NA
#6 f 1 NA NA NA
#7 g NA NA NA 1
#8 h NA NA NA 1
#9 i NA NA NA 1
#10 j NA 1 NA NA
#11 k NA NA 1 NA
if you don't like to stack variables to global environement simply:
(function(df){
m <- df[-1]
IND <- rowSums(m) > 0
m[] <- NA
m[cbind(which(IND),max.col(df[-1])[IND])] <- 1
cbind(df[1],m)
})(df) # run this n-times
Upvotes: 0
Reputation: 1086
Here a solution replacing with NAs at random 3 years out 4 per subject
for (i in 1:dim(df)[1]){
df[i,c(sample(2:5,3))]<-NA
}
Upvotes: 0