mp3242
mp3242

Reputation: 529

How to add new rows with different value in one column in R

Basically I have a vector names of all names, and a dataframe df with a BIN (0/1) field and a NAME field. For every row with BIN==0, I want to create a duplicate row but with 1 instead and add it to the bottom of df with a different name. Here's what I have to select a new name, given the current name:

sample(names[names!=name], 1)

But I'm not sure how to vectorize this and furthermore add it to df with the same data from BIN.

EDIT: Sample data:

df = data.frame(BIN=c(1,0,1), NAME=c("alice","bob","cate"))
names = c("alice","bob","cate","dan")

I got closer with something like this:

rbind(df, df %>% filter(BIN == 0) %>%
    mutate(NAME = sample(names[names!=NAME],1)))

But I get an error: In binattr(e1, e2): length(e1) not a multiple of length(e2).

Upvotes: 1

Views: 762

Answers (3)

Gregor Thomas
Gregor Thomas

Reputation: 146224

Here's a simple approach. I think it's pretty straightforward, let me know if you have questions:

rename = subset(df, BIN == 0)
rename$NEW_NAME = sample(names, size = nrow(rename), replace = TRUE)
while(any(rename$NAME == rename$NEW_NAME)) {
  matches = rename$NAME == rename$NEW_NAME
  rename$NEW_NAME[matches] = sample(names, size = sum(matches), replace = TRUE)
}
rename$BIN = 1
rename$NAME = rename$NEW_NAME
rename$NEW_NAME = NULL

result = rbind(df, rename)
result
#    BIN  NAME
# 1    1 alice
# 2    0   bob
# 3    1  cate
# 21   1 alice

Here's another approach, less clear but more efficient. This is the "right" way to do it, but it requires a little bit more thought and explanation.

df$NAME = factor(df$NAME, levels = names)
rename = subset(df, BIN == 0)
n = length(names)
# we will increment each level number with a random integer from
# 1 to n - 1 (with a mod to make it cyclical)
offset = sample(1:(n - 1), size = nrow(rename), replace = TRUE)
adjusted = (as.integer(rename$NAME) + offset) %% n
# reconcile 1-indexed factor levels with 0-indexed mod operator
adjusted[adjusted == 0] = n
rename$NAME = names[adjusted]
rename$BIN = 1
result = rbind(df, rename)

(or, rewritten for dplyr)

df = mutate(df, NAME = factor(NAME, levels = names))
n = length(names)
df %>% filter(BIN == 0) %>%
  mutate(
    offset = sample(1:(n - 1), size = n(), replace = TRUE),
    adjusted = (as.integer(NAME) + offset) %% n,
    adjusted = if_else(adjusted == 0, n, adjusted),
    NAME = names[adjusted],
    BIN = 1
  ) %>%
  select(-offset, -adjusted) %>%
  rbind(df, .)

Since your issue is the vectorization part, I'd recommend testing answer on a sample case with more than one BIN 0 row, I used this:

df = data.frame(BIN=c(1,0,1,0,0,0,0,0,0), NAME=rep(c("alice","bob","cate"), 3))

And, because I was curious, here's a benchmark for 10k rows with 26 names. Results first, code below:

# Unit: milliseconds
#             expr        min         lq      mean     median         uq        max neval
#       while_loop  34.070438  34.327020  37.53357  35.548047  39.922918  46.206454    10
#        increment   1.397617   1.458592   1.88796   1.526512   2.123894   3.196104    10
#  increment_dplyr  24.002169  24.681960  25.50568  25.374429  25.750548  28.054954    10
#         map_char 346.531498 347.732905 361.82468 359.736403 374.648635 383.575265    10

The "clever" way is by far the fastest. My guess is the dplyr slowdown is because we can't do the direct replacement of only the relevant bits of adjusted, and instead have to add the overhead of if_else. That and we are actually adding columns to the data frame for adjusted and offset rather than dealing with vectors. This is enough to make it almost as slow as the while loop approach, which is still 10x faster than the map_chr which has to go one row at a time.

nn = 10000
df = data.frame(
  BIN = sample(0:1, size = nn, replace = TRUE, prob = c(0.7, 0.3)),
  NAME = factor(sample(letters, size = nn, replace = TRUE), levels = letters)
)

get.new.name <- function(c){
    return(sample(names[names!=c],1))
}

microbenchmark::microbenchmark(
  while_loop = {
    rename = subset(df, BIN == 0)
    rename$NEW_NAME = sample(names, size = nrow(rename), replace = TRUE)
    while (any(rename$NAME == rename$NEW_NAME)) {
      matches = rename$NAME == rename$NEW_NAME
      rename$NEW_NAME[matches] = sample(names, size = sum(matches), replace = TRUE)
    }
    rename$BIN = 1
    rename$NAME = rename$NEW_NAME
    rename$NEW_NAME = NULL
    result = rbind(df, rename)
  }, 
  increment = {
    rename = subset(df, BIN == 0)
    n = length(names)
    # we will increment each level number with a random integer from
    # 1 to n - 1 (with a mod to make it cyclical)
    offset = sample(1:(n - 1), size = nrow(rename), replace = TRUE)
    adjusted = (as.integer(rename$NAME) + offset) %% n
    # reconcile 1-indexed factor levels with 0-indexed mod operator
    adjusted[adjusted == 0] = n
    rename$NAME = names[adjusted]
    rename$BIN = 1
  },
  increment_dplyr = {
    n = length(names)
    df %>% filter(BIN == 0) %>%
      mutate(
        offset = sample(1:(n - 1), size = n(), replace = TRUE),
        adjusted = (as.integer(NAME) + offset) %% n,
        adjusted = if_else(adjusted == 0, n, adjusted),
        NAME = names[adjusted],
        BIN = 1
      ) %>%
      select(-offset,-adjusted)
  },
  map_char = {
    new.df <- df %>% filter(BIN == 0) %>%
      mutate(NAME = map_chr(NAME, get.new.name)) %>%
      mutate(BIN = 1)
  },
  times = 10
)

Upvotes: 2

mp3242
mp3242

Reputation: 529

Well I didn't intend to answer my own question but I did find a simpler solution. I think it's better than using rowwise() but I don't know if it's necessarily the most efficient way.

library(tidyverse)

get.new.name <- function(c){
    return(sample(names[names!=c],1))
}

new.df <- rbind(df, df %>% filter(BIN == 0) %>%
    mutate(NAME = map_chr(NAME, get.new.name)) %>%
    mutate(BIN = 1)

map_char ended up being pretty important instead of just map since the latter would return a weird list of lists.

Upvotes: 1

Jack Brookes
Jack Brookes

Reputation: 3830

A little weird but I think this should be what you want:

library(tidyverse)

df <- data.frame(BIN=c(1,0,1), NAME=c("alice","bob","cate"), stringsAsFactors = FALSE)
names <- c("alice","bob","cate","dan")

df %>% 
  mutate(NAME_new = ifelse(BIN == 0, sample(names, n(), replace = TRUE), NA)) %>% 
  gather(name_type, NAME, NAME:NAME_new, na.rm = TRUE) %>% 
  mutate(BIN = ifelse(name_type == "NAME_new", 1, BIN)) %>% 
  select(-name_type)

Output:

  BIN  NAME
1   1 alice
2   0   bob
3   1  cate
4   1 alice

Upvotes: -1

Related Questions