Reputation: 18585
I'm generating some sample data that I want to use to show some simple analytical operations on Spark. Spark context is not relevant here.
The data I'm using looks as follows:
library("tidy verse")
set.seed(123)
dta_ts <-
tibble(category = sample(LETTERS[1:4], replace = TRUE, size = 1e5)) %>%
group_by(category) %>%
mutate(ref_dte = sample(
x = seq(as.Date('2010-01-01'), as.Date('2016-12-30'), by = "1 day"),
size = n(),
replace = TRUE
)) %>%
ungroup() %>%
distinct() %>%
mutate(rand_val = rpois(n(), lambda = 10))
I would like to insert some outliers to the data. In base R this is easy to achieve using:
# Add outliers
for (i in sample(1:nrow(dta_ts), 50)) {
dta_ts[i,3] <- sample(1e4:1e6, 1)
}
The provided solution is, arguably, inefficient and inelegant. I would like to find dplyr'ish way of achieving the same result.. I'm aware of sample_n
and sample_frac
but I'm not interested in sampling data, only in accessing random selection row. The ideal solution would function as a follow-up addition to te pipeline below:
... %>%
mutate(rand_val = rpois(n(), lambda = 10)) %>%
# On random outliers are created
Upvotes: 1
Views: 156
Reputation: 887118
We can also use case_when
library(dplyr)
n <- 50
dta_ts %>%
mutate(rand_val = case_when(row_number() %in% sample(n(), n) ~ sample(1e4:1e6), TRUE ~ rand_val))
Or using base R
i1 <- sample(nrow(dta_ts), n)
dta_ts$rand_ts[i,1] <- sample(1e4:1e6, n)
Upvotes: 1
Reputation: 388982
You can generate random n
values from 1 to number of rows in the data and replace
them with n
high values from 1e4:1e6
.
library(dplyr)
n <- 50
dta_ts %>%
mutate(rand_val = replace(rand_val, sample(n(), n), sample(1e4:1e6, n)))
You could continue this in the same chain from your attempt, I am showing it differently here.
You can use the same logic in base R as well.
transform(dta_ts,
rand_val = replace(rand_val, sample(nrow(dta_ts), n), sample(1e4:1e6, n)))
Upvotes: 1