ComputingVictor
ComputingVictor

Reputation: 100

How to obtain a random sample in order to their category?

I have a DF called 'listing_df', what I need is to compute the NA data from the variables 'number_of_reviews' and 'review_scores_rating' creating random samples according to the numbers of each group of 'room_type'.

I attach a picture of how the DF looks like:

listing_df.png

I tried first of all grouping by 'room_typeI:

test <- listings_df %>% group_by(room_type)

Then, I select the columns where I want to transform the Na data, and create the samples

test$number_of_reviews[is.na(listings_df$number_of_reviews)] <- 
  sample(listings_df$number_of_reviews, size = sum(is.na(listings_df$number_of_reviews)))

test$review_scores_rating[is.na(listings_df$review_scores_rating)] <- 
  sample(listings_df$review_scores_rating, size = sum(is.na(listings_df$review_scores_rating)))

I am not sure if it's createn the random data according the room_type, also I would like to know if it's possible to manage this creating a loop.

Thanks!

Upvotes: 1

Views: 73

Answers (1)

r2evans
r2evans

Reputation: 160687

What you're asking for is called imputation. I'll demonstrate using mtcars as the data, cyl as the grouping variable (your room_type, I suspect), and other columns with NA values.

mt <- mtcars
set.seed(42)
mt$disp[sample(32,10)] <- NA
mt$hp[sample(32,10)] <- NA
head(mt)
#                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
# Mazda RX4         21.0   6   NA 110 3.90 2.620 16.46  0  1    4    4
# Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
# Datsun 710        22.8   4  108  NA 3.85 2.320 18.61  1  1    4    1
# Hornet 4 Drive    21.4   6   NA  NA 3.08 3.215 19.44  1  0    3    1
# Hornet Sportabout 18.7   8   NA  NA 3.15 3.440 17.02  0  0    3    2
# Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

From here:

dplyr

library(dplyr)
set.seed(42)
mt %>%
  group_by(cyl) %>%
  mutate(across(c(disp, hp), ~ coalesce(., sample(na.omit(.), size=n(), replace=TRUE)))) %>%
  ungroup()
# # A tibble: 32 × 11
#      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
#  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#  3  22.8     4  108     91  3.85  2.32  18.6     1     1     4     1
#  4  21.4     6  145    110  3.08  3.22  19.4     1     0     3     1
#  5  18.7     8  301    180  3.15  3.44  17.0     0     0     3     2
#  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#  7  14.3     8  301    245  3.21  3.57  15.8     0     0     3     4
#  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#  9  22.8     4  141.    52  3.92  3.15  22.9     1     0     4     2
# 10  19.2     6  225    123  3.92  3.44  18.3     1     0     4     4
# # … with 22 more rows

data.table

library(data.table)
cols <- c("disp", "hp")
set.seed(42)
as.data.table(mt)[, c(cols) := lapply(.SD, \(z) fcoalesce(z, sample(na.omit(z), size=.N, replace=TRUE))), .SDcols = cols][] |>
  head()
> 
#      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#    <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
# 1:  21.0     6  79.0   110  3.90 2.620 16.46     0     1     4     4
# 2:  21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4
# 3:  22.8     4 108.0   113  3.85 2.320 18.61     1     1     4     1
# 4:  21.4     6 460.0    97  3.08 3.215 19.44     1     0     3     1
# 5:  18.7     8 146.7    62  3.15 3.440 17.02     0     0     3     2
# 6:  18.1     6 225.0   105  2.76 3.460 20.22     1     0     3     1

base R

set.seed(42)
mt[cols] <- lapply(mt[cols], \(z) ave(z, mt$cyl, FUN = \(z) ifelse(is.na(z), sample(na.omit(z), size=length(z), replace=TRUE), z)))
head(mt)
#                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
# Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
# Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
# Datsun 710        22.8   4  108  91 3.85 2.320 18.61  1  1    4    1
# Hornet 4 Drive    21.4   6  145 110 3.08 3.215 19.44  1  0    3    1
# Hornet Sportabout 18.7   8  301 180 3.15 3.440 17.02  0  0    3    2
# Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Note: while I set the random seed for imputation in each dialect, there is no expectation that the order of columns and fixes will be the same between the dialects. For this reason we see that the replacement for the NA values is not the same between the dialects of code; the seed is provided for basic reproducibility, not identical results.

Upvotes: 1

Related Questions