Tabea
Tabea

Reputation: 67

Randomize per group

I am trying to randomize a numerical vector in a data frame with R. My data looks something like this:

user click  
1025     0        
1025     1        
1025     0        
1025     0        
1025     0        
1025     0        
1025     1        
1025     0        
1025     0        
1025     0        
1025     0        
14639    1        
14639    0  
14639    0
14639    1      
11605    0        
11605    0        
14605    1        

In the data some users appear more often than other users. I now want to change each user ID. Let's say there are 100 unique user IDs. At the end, I want to have 100 different unique user IDs.

I tried dplyr:

data %>% group_by(user) %>% mutate(anon = rep(sample(length(unique(data$user)), 1, replace = F)), n())

However, that doesn't work because the sampling is done separately for each user; ignoring the other users. As a result, some users end up having the same new userID.

Can someone tell me how I can - at random - create a new user ID (that does not repeat) for each person in the data frame?

Upvotes: 1

Views: 69

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 389275

You can sample random numbers from 1 to n where n is number of unique user. You can then assign a random id to each user.

set.seed(1234)
inds <- sample(length(unique(df$user)))
df$new_user <- inds[match(df$user, unique(df$user))]
df

#    user click new_user
#1   1025     0        4
#2   1025     1        4
#3   1025     0        4
#4   1025     0        4
#5   1025     0        4
#6   1025     0        4
#7   1025     1        4
#8   1025     0        4
#9   1025     0        4
#10  1025     0        4
#11  1025     0        4
#12 14639     1        2
#13 14639     0        2
#14 14639     0        2
#15 14639     1        2
#16 11605     0        3
#17 11605     0        3
#18 14605     1        1

Upvotes: 0

shizundeiku
shizundeiku

Reputation: 320

I would solve this by first generating some user IDs, then creating a temporary tibble that associates existing with new user IDs, then joining your previous data with this table:

# Randomly generate some user IDs
new_user_ids = shuffle(seq(1, length(unique(df$user))))

# Join
data %>%
  left_join(tibble(user = unique(df$user), new.user = new_user_ids)) %>%
  mutate(user = new.user) %>% select(-new.user)

This gives the following result, for example:

    user click
   <int> <dbl>
 1     3     0
 2     3     1
 3     3     0
 4     3     0
 5     3     0
 6     3     0
 7     3     1
 8     3     0
 9     3     0
10     3     0
11     3     0
12     2     1
13     2     0
14     2     0
15     2     1
16     4     0
17     4     0
18     1     1

Upvotes: 1

Related Questions