Reputation: 420
I have data that looks like this.
investor_name funding_round_type count
<chr> <chr> <int>
1 .406 Ventures angel 1
2 .406 Ventures other 2
3 .406 Ventures private-equity 1
4 .406 Ventures series-a 5
5 .406 Ventures series-b 2
6 .406 Ventures series-c+ 7
7 .406 Ventures venture 1
8 500 Startups angel 40
I would like to replace all the instances where funding_round_type
is equal to venture
and replace it with either series-a
, series-b
or series-c+
. I'd like to randomly select one of those with a 40% chance for either of the first two and a 20% for the last one.
my_df %>%
mutate(funding_round_type = ifelse(funding_round_type == "venture",
sample(c("series-a", "series-b", "series-c"), 1, replace = TRUE, prob = c(.4, .4, .2)),
funding_round_type))
Weirdly, the sample()
seems to choose once and then revert to the chosen value for every row. I've run this a few times and it replaces venture
with only one of the values from my list of options and doesn't include any instances of any of the other values.
How can I get sample() to run fresh on every row?
Upvotes: 2
Views: 48
Reputation: 887311
We can use data.table
methods
library(data.table)
setDT(df)[funding_round_type == "venture", funding_round_type :=
sample(c("series-a", "series-b", "series-c+"), 1, prob = c(.4, .4, .2))][]
# investor_name funding_round_type count
#1: .406 Ventures angel 1
#2: .406 Ventures other 2
#3: .406 Ventures private-equity 1
#4: .406 Ventures series-a 5
#5: .406 Ventures series-b 2
#6: .406 Ventures series-c+ 7
#7: .406 Ventures series-b 1
#8: 500 Startups angel 40
Or using case_when
from tidyverse
library(tidyerse)
df %>%
mutate(funding_round_type = case_when(funding_round_type == "venture" ~
sample(c("series-a", "series-b", "series-c+"), 1, prob = c(.4, .4, .2)),
TRUE ~ funding_round_type))
# investor_name funding_round_type count
#1 .406 Ventures angel 1
#2 .406 Ventures other 2
#3 .406 Ventures private-equity 1
#4 .406 Ventures series-a 5
#5 .406 Ventures series-b 2
#6 .406 Ventures series-c+ 7
#7 .406 Ventures series-a 1
#8 500 Startups angel 40
df <- structure(list(investor_name = c(".406 Ventures", ".406 Ventures",
".406 Ventures", ".406 Ventures", ".406 Ventures", ".406 Ventures",
".406 Ventures", "500 Startups"), funding_round_type = c("angel",
"other", "private-equity", "series-a", "series-b", "series-c+",
"venture", "angel"), count = c(1L, 2L, 1L, 5L, 2L, 7L, 1L, 40L
)), class = "data.frame", row.names = c("1", "2", "3", "4", "5",
"6", "7", "8"))
Upvotes: 0
Reputation: 389055
It is because ifelse
runs the sample
function only once and you are selecting one value from it which is recycled for every other value. Try doing
library(dplyr)
my_df %>%
mutate(funding_round_type = ifelse(funding_round_type == "venture",
sample(c("series-a", "series-b", "series-c"),
sum(funding_round_type == "venture"),replace = TRUE, prob = c(.4, .4, .2)),
funding_round_type))
Or with replace
my_df %>%
mutate(funding_round_type = replace(funding_round_type,
funding_round_type == "venture", sample(c("series-a", "series-b", "series-c"),
sum(funding_round_type == "venture"), replace = TRUE, prob = c(.4, .4, .2))))
Also you can replace this directly, without any ifelse
or any packages.
my_df$funding_round_type[my_df$funding_round_type == "venture"] <-
with(my_df, sample(c("series-a", "series-b", "series-c"),
sum(funding_round_type == "venture"), replace = TRUE, prob = c(.4, .4, .2)))
Upvotes: 2
Reputation: 21274
Using rowwise()
will resample for each row:
df %>%
rowwise %>%
mutate(funding_round_type = if_else(
funding_round_type == "venture",
sample(c("series-a", "series-b", "series-c+"), 1, prob = c(.4, .4, .2)),
funding_round_type))
Also - minor, but you don't need replace=TRUE
since you're only pulling one sample per call to sample()
.
Upvotes: 0