Virginia Morera Pujol
Virginia Morera Pujol

Reputation: 477

Selecting unique values without repeating colums

This is a very specific question: I have a set of observations, from a number of subjects, along several years (only one observation per year). I want to select only one observation of each individual, (I don't care from which year) in a way that I end up with a similar number of observations per year, and as random as possible.

So, starting from df in which 1 are years there is an observation for that individual, and 0 are years where there's no observation for that individual:

df <- data.frame(Ind   = c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k"),
             Year1 = c(1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0), 
             Year2 = c(0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0), 
             Year3 = c(1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1), 
             Year4 = c(0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1))

which looks like

df example View

I'd like to end up with something like this

df2 example View

EDIT: trying to apply solutions proposed (and failing)

(1) earch's answer:

df <- as_tibble(df)

year.weights <- df %>% 
  gather(Year, Obs, -Ind) %>% 
  group_by(Year) %>% 
  summarize(wt = sum(Obs)) %>% 
  ungroup


df %>% 
      gather(Year, Obs, -Ind) %>%
      filter(Obs == 1) %>% 
      left_join(year.weights, by = "Year") %>% 
      group_by(Ind) %>% 
      sample_n(1, weight = 1 / wt) %>% 
      select(-wt) %>% 
      spread(Year, Obs) %>% 
      ungroup

This gives an error Error: 'by' can't contain join column 'Year' which is missing from RHS which appears in the left_join step. I try to solve this by giving the name "Year" to the only variable in RHS

names(year.weights) <- "Year"

But now this gives a new error: Error in left_join_impl(x, y, by_x, by_y, aux_x, aux_y, na_matches) : Can't join on 'Year' x 'Year' because of incompatible types (numeric / character) which actually makes a lot of sense as Year column in LHS contains Year1, Year2, Year3, etc. while Year column in RHS contains the number 27.

This is as far as I got because I can't see what earch was trying to accomplish, but I do believe a feasible solution can be achieved through this n_sample and the weight argument, but I can't quite see it yet.

(2) Mikey's answer:

This works well (I don't get the error I was getting before) but it doesn't guarantee that I'll get an equal (or similar) number of 1s for each "Year" column.

So, if I run the code a couple of times for testing, I get:

# first time
      [,1] [,2] [,3] [,4]
 [1,]    0    0    0    1
 [2,]    1    0    0    0
 [3,]    0    0    1    0
 [4,]    0    1    0    0
 [5,]    1    0    0    0
 [6,]    0    0    1    0
 [7,]    0    0    0    1
 [8,]    0    1    0    0
 [9,]    0    0    0    1
[10,]    0    0    0    1
[11,]    0    0    0    1

# second time
      [,1] [,2] [,3] [,4]
 [1,]    1    0    0    0
 [2,]    1    0    0    0
 [3,]    0    0    1    0
 [4,]    0    1    0    0
 [5,]    0    0    0    1
 [6,]    1    0    0    0
 [7,]    1    0    0    0
 [8,]    0    0    0    1
 [9,]    0    0    0    1
[10,]    0    0    0    1
[11,]    0    0    1    0

(3) Andre Elrico's answer:

It has the same problem as answer (2), it doesn't guarantee an equal number of 1s for each Year: see two random outputs:

# fist try
   Ind Year1 Year2 Year3 Year4
1    a    NA    NA    NA     1
2    b    NA    NA     1    NA
3    c    NA    NA     1    NA
4    d    NA     1    NA    NA
5    e     1    NA    NA    NA
6    f    NA    NA     1    NA
7    g     1    NA    NA    NA
8    h    NA    NA    NA     1
9    i    NA    NA    NA     1
10   j    NA    NA    NA     1
11   k    NA    NA     1    NA

# second try
   Ind Year1 Year2 Year3 Year4
1    a     1    NA    NA    NA
2    b     1    NA    NA    NA
3    c    NA    NA     1    NA
4    d    NA    NA     1    NA
5    e    NA     1    NA    NA
6    f    NA    NA    NA     1
7    g    NA    NA    NA     1
8    h    NA    NA    NA     1
9    i    NA    NA    NA     1
10   j    NA     1    NA    NA
11   k    NA    NA     1    NA

(4) paoloeusebi's answer has the same problem as the previous ones. Not guaranteeing equal number of selected 1s per row:

# first try
   Ind Year1 Year2 Year3 Year4
1    a     1    NA    NA    NA
2    b    NA    NA    NA     0
3    c    NA    NA     1    NA
4    d    NA    NA    NA     0
5    e    NA    NA     1    NA
6    f    NA    NA    NA     1
7    g     1    NA    NA    NA
8    h    NA    NA     0    NA
9    i    NA    NA    NA     1
10   j    NA    NA    NA     1
11   k    NA    NA     1    NA

# second try
   Ind Year1 Year2 Year3 Year4
1    a    NA    NA    NA     1
2    b    NA     0    NA    NA
3    c    NA     1    NA    NA
4    d    NA    NA    NA     0
5    e    NA    NA    NA     1
6    f    NA     0    NA    NA
7    g    NA     0    NA    NA
8    h    NA    NA     0    NA
9    i    NA    NA     0    NA
10   j    NA    NA     0    NA
11   k    NA     0    NA    NA

Upvotes: 1

Views: 120

Answers (4)

mickey
mickey

Reputation: 2188

Here's some code. Maybe not so elegant, but it's a start:

new_mat = function(df, max_iter = 100){
    ind_names <- df[,1]
    df <- df[,-1]
    n = NROW(df)
    k = NCOL(df)
    max_col = ceiling(n / k)
    resample = function(x, ...) x[sample.int(length(x), ...)]
    one_hot = function(i, n){
        x = double(n)
        x[i] = 1
        return (x)
        }
    counter = 0
    flag = TRUE
    while (flag && counter <= max_iter){
        counter = counter + 1
        out = matrix(0, n, k)
        weights = rep(max_col, k)
        index = sample(1:n)
        c2 = 0
        for (i in index){
            ind = which(df[i,] == 1)
            probs = weights[ind]
            if (max(probs) == 0)
                break
            out[i,] = one_hot(resample(ind, size = 1, prob = probs), k)
            weights = weights - out[i,]
            c2 = c2 + 1
            }
        if (c2 == length(index))
            flag = FALSE
        }
    if (flag)
        stop('No matrix found. Try again.')
    final <- cbind(ind_names, as.data.frame(out))
    names(final) <- c("ind", names(df))
    return (final)
    }

Requiring each column to have (roughly) the same number of occurrences poses quite a problem if you also want to randomize which column is selected. This is especially a problem when not all rows have observations present in each column. There is an implied dependence from row to row, which is probably not desirable.

Basically, this eventually sets a column's weight of being selected down to zero, once that column as reached max_col, or the maximum number of occurrences after which the columns cannot have roughly the same amount. (I borrowed earch's idea of weighting the columns.)

If a problem occurs (e.g. no column can be selected for the next row which has weight>0), then the process is re-run, up to a maximum of max_iter, but going through a different order of the rows.

A major drawback of this is having to repeatedly loop through all the rows. I'm not sure of a way around this given your constraints. So if you have a really large data frame, you might see some long computation times. But on the data set you provided, the function return a matrix usually after just one pass, no more than a few.

Upvotes: 1

Gramposity
Gramposity

Reputation: 176

If you wanted a random Year that was 1 for each individual, then here is a dplyr/tidyr method:

> df <- data.frame(Ind   = c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k"),
+                  Year1 = c(1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0), 
+                  Year2 = c(0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0), 
+                  Year3 = c(1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1), 
+                  Year4 = c(0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1))
> 
> year.weights <- df %>% 
+   gather(Year, Obs, -Ind) %>% 
+   group_by(Year) %>% 
+   summarize(wt = sum(Obs)) %>% 
+   ungroup
> 
> year.weights
# A tibble: 4 x 2
  Year     wt
  <chr> <dbl>
1 Year1     7
2 Year2     5
3 Year3     7
4 Year4     7
> 
> 
> df %>% 
+   gather(Year, Obs, -Ind) %>%
+   filter(Obs == 1) %>% 
+   left_join(year.weights, by = "Year") %>% 
+   group_by(Ind) %>% 
+   sample_n(1, weight = 1 / wt) %>% 
+   select(-wt) %>% 
+   spread(Year, Obs) %>% 
+   ungroup
# A tibble: 11 x 5
   Ind   Year1 Year2 Year3 Year4
   <fct> <dbl> <dbl> <dbl> <dbl>
 1 a         1    NA    NA    NA
 2 b        NA    NA     1    NA
 3 c        NA     1    NA    NA
 4 d         1    NA    NA    NA
 5 e        NA    NA     1    NA
 6 f         1    NA    NA    NA
 7 g        NA    NA    NA     1
 8 h        NA    NA    NA     1
 9 i        NA    NA    NA     1
10 j        NA     1    NA    NA
11 k        NA    NA    NA     1

Upvotes: 2

Andre Elrico
Andre Elrico

Reputation: 11480

m   <- df[-1]
IND <- rowSums(m) > 0
m[] <- NA
m[cbind(which(IND),max.col(df[-1])[IND])] <- 1
cbind(df[1],m)

result:

#   Ind Year1 Year2 Year3 Year4
#1    a     1    NA    NA    NA
#2    b    NA    NA     1    NA
#3    c    NA    NA     1    NA
#4    d    NA    NA     1    NA
#5    e    NA    NA     1    NA
#6    f     1    NA    NA    NA
#7    g    NA    NA    NA     1
#8    h    NA    NA    NA     1
#9    i    NA    NA    NA     1
#10   j    NA     1    NA    NA
#11   k    NA    NA     1    NA

if you don't like to stack variables to global environement simply:

(function(df){
    m   <- df[-1]
    IND <- rowSums(m) > 0
    m[] <- NA
    m[cbind(which(IND),max.col(df[-1])[IND])] <- 1
    cbind(df[1],m)
})(df)   # run this n-times

Upvotes: 0

paoloeusebi
paoloeusebi

Reputation: 1086

Here a solution replacing with NAs at random 3 years out 4 per subject

for (i in 1:dim(df)[1]){
    df[i,c(sample(2:5,3))]<-NA
    }

Upvotes: 0

Related Questions