Konrad
Konrad

Reputation: 18585

dplyr'ish version of applying function to random rows

Background

I'm generating some sample data that I want to use to show some simple analytical operations on Spark. Spark context is not relevant here.

Sample data

The data I'm using looks as follows:

library("tidy verse")
set.seed(123)
dta_ts <-
  tibble(category = sample(LETTERS[1:4], replace = TRUE, size = 1e5)) %>% 
  group_by(category) %>%
  mutate(ref_dte = sample(
    x = seq(as.Date('2010-01-01'), as.Date('2016-12-30'), by = "1 day"),
    size = n(),
    replace = TRUE
  )) %>%
  ungroup() %>% 
  distinct() %>% 
  mutate(rand_val = rpois(n(), lambda = 10))

Question

I would like to insert some outliers to the data. In base R this is easy to achieve using:

# Add outliers
for (i in sample(1:nrow(dta_ts), 50)) {
  dta_ts[i,3] <- sample(1e4:1e6, 1)
}

Problem

The provided solution is, arguably, inefficient and inelegant. I would like to find dplyr'ish way of achieving the same result.. I'm aware of sample_n and sample_frac but I'm not interested in sampling data, only in accessing random selection row. The ideal solution would function as a follow-up addition to te pipeline below:

... %>%
mutate(rand_val = rpois(n(), lambda = 10)) %>%
# On random outliers are created

Upvotes: 1

Views: 156

Answers (2)

akrun
akrun

Reputation: 887118

We can also use case_when

library(dplyr)
n <- 50
dta_ts %>%
     mutate(rand_val = case_when(row_number() %in% sample(n(), n) ~ sample(1e4:1e6),  TRUE ~ rand_val))

Or using base R

i1 <- sample(nrow(dta_ts), n)
dta_ts$rand_ts[i,1] <- sample(1e4:1e6, n)

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 388982

You can generate random n values from 1 to number of rows in the data and replace them with n high values from 1e4:1e6.

library(dplyr)
n <- 50

dta_ts %>%
    mutate(rand_val = replace(rand_val, sample(n(), n), sample(1e4:1e6, n)))

You could continue this in the same chain from your attempt, I am showing it differently here.

You can use the same logic in base R as well.

transform(dta_ts, 
    rand_val = replace(rand_val, sample(nrow(dta_ts), n), sample(1e4:1e6, n)))

Upvotes: 1

Related Questions