Reputation: 529
Basically I have a vector names
of all names, and a dataframe df
with a BIN (0/1) field and a NAME field. For every row with BIN==0
, I want to create a duplicate row but with 1 instead and add it to the bottom of df
with a different name. Here's what I have to select a new name, given the current name:
sample(names[names!=name], 1)
But I'm not sure how to vectorize this and furthermore add it to df
with the same data from BIN.
EDIT: Sample data:
df = data.frame(BIN=c(1,0,1), NAME=c("alice","bob","cate"))
names = c("alice","bob","cate","dan")
I got closer with something like this:
rbind(df, df %>% filter(BIN == 0) %>%
mutate(NAME = sample(names[names!=NAME],1)))
But I get an error: In binattr(e1, e2): length(e1) not a multiple of length(e2).
Upvotes: 1
Views: 762
Reputation: 146224
Here's a simple approach. I think it's pretty straightforward, let me know if you have questions:
rename = subset(df, BIN == 0)
rename$NEW_NAME = sample(names, size = nrow(rename), replace = TRUE)
while(any(rename$NAME == rename$NEW_NAME)) {
matches = rename$NAME == rename$NEW_NAME
rename$NEW_NAME[matches] = sample(names, size = sum(matches), replace = TRUE)
}
rename$BIN = 1
rename$NAME = rename$NEW_NAME
rename$NEW_NAME = NULL
result = rbind(df, rename)
result
# BIN NAME
# 1 1 alice
# 2 0 bob
# 3 1 cate
# 21 1 alice
Here's another approach, less clear but more efficient. This is the "right" way to do it, but it requires a little bit more thought and explanation.
df$NAME = factor(df$NAME, levels = names)
rename = subset(df, BIN == 0)
n = length(names)
# we will increment each level number with a random integer from
# 1 to n - 1 (with a mod to make it cyclical)
offset = sample(1:(n - 1), size = nrow(rename), replace = TRUE)
adjusted = (as.integer(rename$NAME) + offset) %% n
# reconcile 1-indexed factor levels with 0-indexed mod operator
adjusted[adjusted == 0] = n
rename$NAME = names[adjusted]
rename$BIN = 1
result = rbind(df, rename)
(or, rewritten for dplyr
)
df = mutate(df, NAME = factor(NAME, levels = names))
n = length(names)
df %>% filter(BIN == 0) %>%
mutate(
offset = sample(1:(n - 1), size = n(), replace = TRUE),
adjusted = (as.integer(NAME) + offset) %% n,
adjusted = if_else(adjusted == 0, n, adjusted),
NAME = names[adjusted],
BIN = 1
) %>%
select(-offset, -adjusted) %>%
rbind(df, .)
Since your issue is the vectorization part, I'd recommend testing answer on a sample case with more than one BIN 0 row, I used this:
df = data.frame(BIN=c(1,0,1,0,0,0,0,0,0), NAME=rep(c("alice","bob","cate"), 3))
And, because I was curious, here's a benchmark for 10k rows with 26 names. Results first, code below:
# Unit: milliseconds
# expr min lq mean median uq max neval
# while_loop 34.070438 34.327020 37.53357 35.548047 39.922918 46.206454 10
# increment 1.397617 1.458592 1.88796 1.526512 2.123894 3.196104 10
# increment_dplyr 24.002169 24.681960 25.50568 25.374429 25.750548 28.054954 10
# map_char 346.531498 347.732905 361.82468 359.736403 374.648635 383.575265 10
The "clever" way is by far the fastest. My guess is the dplyr
slowdown is because we can't do the direct replacement of only the relevant bits of adjusted
, and instead have to add the overhead of if_else
. That and we are actually adding columns to the data frame for adjusted
and offset
rather than dealing with vectors. This is enough to make it almost as slow as the while
loop approach, which is still 10x faster than the map_chr
which has to go one row at a time.
nn = 10000
df = data.frame(
BIN = sample(0:1, size = nn, replace = TRUE, prob = c(0.7, 0.3)),
NAME = factor(sample(letters, size = nn, replace = TRUE), levels = letters)
)
get.new.name <- function(c){
return(sample(names[names!=c],1))
}
microbenchmark::microbenchmark(
while_loop = {
rename = subset(df, BIN == 0)
rename$NEW_NAME = sample(names, size = nrow(rename), replace = TRUE)
while (any(rename$NAME == rename$NEW_NAME)) {
matches = rename$NAME == rename$NEW_NAME
rename$NEW_NAME[matches] = sample(names, size = sum(matches), replace = TRUE)
}
rename$BIN = 1
rename$NAME = rename$NEW_NAME
rename$NEW_NAME = NULL
result = rbind(df, rename)
},
increment = {
rename = subset(df, BIN == 0)
n = length(names)
# we will increment each level number with a random integer from
# 1 to n - 1 (with a mod to make it cyclical)
offset = sample(1:(n - 1), size = nrow(rename), replace = TRUE)
adjusted = (as.integer(rename$NAME) + offset) %% n
# reconcile 1-indexed factor levels with 0-indexed mod operator
adjusted[adjusted == 0] = n
rename$NAME = names[adjusted]
rename$BIN = 1
},
increment_dplyr = {
n = length(names)
df %>% filter(BIN == 0) %>%
mutate(
offset = sample(1:(n - 1), size = n(), replace = TRUE),
adjusted = (as.integer(NAME) + offset) %% n,
adjusted = if_else(adjusted == 0, n, adjusted),
NAME = names[adjusted],
BIN = 1
) %>%
select(-offset,-adjusted)
},
map_char = {
new.df <- df %>% filter(BIN == 0) %>%
mutate(NAME = map_chr(NAME, get.new.name)) %>%
mutate(BIN = 1)
},
times = 10
)
Upvotes: 2
Reputation: 529
Well I didn't intend to answer my own question but I did find a simpler solution. I think it's better than using rowwise()
but I don't know if it's necessarily the most efficient way.
library(tidyverse)
get.new.name <- function(c){
return(sample(names[names!=c],1))
}
new.df <- rbind(df, df %>% filter(BIN == 0) %>%
mutate(NAME = map_chr(NAME, get.new.name)) %>%
mutate(BIN = 1)
map_char
ended up being pretty important instead of just map
since the latter would return a weird list of lists.
Upvotes: 1
Reputation: 3830
A little weird but I think this should be what you want:
library(tidyverse)
df <- data.frame(BIN=c(1,0,1), NAME=c("alice","bob","cate"), stringsAsFactors = FALSE)
names <- c("alice","bob","cate","dan")
df %>%
mutate(NAME_new = ifelse(BIN == 0, sample(names, n(), replace = TRUE), NA)) %>%
gather(name_type, NAME, NAME:NAME_new, na.rm = TRUE) %>%
mutate(BIN = ifelse(name_type == "NAME_new", 1, BIN)) %>%
select(-name_type)
Output:
BIN NAME
1 1 alice
2 0 bob
3 1 cate
4 1 alice
Upvotes: -1