Reputation: 100
I have a DF called 'listing_df', what I need is to compute the NA data from the variables 'number_of_reviews' and 'review_scores_rating' creating random samples according to the numbers of each group of 'room_type'.
I attach a picture of how the DF looks like:
I tried first of all grouping by 'room_typeI:
test <- listings_df %>% group_by(room_type)
Then, I select the columns where I want to transform the Na data, and create the samples
test$number_of_reviews[is.na(listings_df$number_of_reviews)] <-
sample(listings_df$number_of_reviews, size = sum(is.na(listings_df$number_of_reviews)))
test$review_scores_rating[is.na(listings_df$review_scores_rating)] <-
sample(listings_df$review_scores_rating, size = sum(is.na(listings_df$review_scores_rating)))
I am not sure if it's createn the random data according the room_type, also I would like to know if it's possible to manage this creating a loop.
Thanks!
Upvotes: 1
Views: 73
Reputation: 160687
What you're asking for is called imputation. I'll demonstrate using mtcars
as the data, cyl
as the grouping variable (your room_type
, I suspect), and other columns with NA
values.
mt <- mtcars
set.seed(42)
mt$disp[sample(32,10)] <- NA
mt$hp[sample(32,10)] <- NA
head(mt)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 21.0 6 NA 110 3.90 2.620 16.46 0 1 4 4
# Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
# Datsun 710 22.8 4 108 NA 3.85 2.320 18.61 1 1 4 1
# Hornet 4 Drive 21.4 6 NA NA 3.08 3.215 19.44 1 0 3 1
# Hornet Sportabout 18.7 8 NA NA 3.15 3.440 17.02 0 0 3 2
# Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
From here:
library(dplyr)
set.seed(42)
mt %>%
group_by(cyl) %>%
mutate(across(c(disp, hp), ~ coalesce(., sample(na.omit(.), size=n(), replace=TRUE)))) %>%
ungroup()
# # A tibble: 32 × 11
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
# 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
# 3 22.8 4 108 91 3.85 2.32 18.6 1 1 4 1
# 4 21.4 6 145 110 3.08 3.22 19.4 1 0 3 1
# 5 18.7 8 301 180 3.15 3.44 17.0 0 0 3 2
# 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
# 7 14.3 8 301 245 3.21 3.57 15.8 0 0 3 4
# 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
# 9 22.8 4 141. 52 3.92 3.15 22.9 1 0 4 2
# 10 19.2 6 225 123 3.92 3.44 18.3 1 0 4 4
# # … with 22 more rows
library(data.table)
cols <- c("disp", "hp")
set.seed(42)
as.data.table(mt)[, c(cols) := lapply(.SD, \(z) fcoalesce(z, sample(na.omit(z), size=.N, replace=TRUE))), .SDcols = cols][] |>
head()
>
# mpg cyl disp hp drat wt qsec vs am gear carb
# <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
# 1: 21.0 6 79.0 110 3.90 2.620 16.46 0 1 4 4
# 2: 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
# 3: 22.8 4 108.0 113 3.85 2.320 18.61 1 1 4 1
# 4: 21.4 6 460.0 97 3.08 3.215 19.44 1 0 3 1
# 5: 18.7 8 146.7 62 3.15 3.440 17.02 0 0 3 2
# 6: 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
set.seed(42)
mt[cols] <- lapply(mt[cols], \(z) ave(z, mt$cyl, FUN = \(z) ifelse(is.na(z), sample(na.omit(z), size=length(z), replace=TRUE), z)))
head(mt)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
# Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
# Datsun 710 22.8 4 108 91 3.85 2.320 18.61 1 1 4 1
# Hornet 4 Drive 21.4 6 145 110 3.08 3.215 19.44 1 0 3 1
# Hornet Sportabout 18.7 8 301 180 3.15 3.440 17.02 0 0 3 2
# Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Note: while I set the random seed for imputation in each dialect, there is no expectation that the order of columns and fixes will be the same between the dialects. For this reason we see that the replacement for the NA
values is not the same between the dialects of code; the seed is provided for basic reproducibility, not identical results.
Upvotes: 1