socialscientist
socialscientist

Reputation: 4232

R: Estimating weighted quantile by group with assignment

I am trying to calculate the quantile (0 to 100) for each observation for a continuous variable (let's call it 'value') within each group when I have sampling weights and assign each observation to its respective quantile in a new variable.

In other words, each row is an observation and each observation belongs to a single group. All of the groups have more than 2 observations. Within each group, I need to estimate the distribution of value using the sampling weights in my data, determine at which percentile an observation falls within its group's distribution, then add that percentile as a column to the data frame.

As far as I can tell, the survey package has svyby() and svyquantile() but the latter returns values for the specified quantiles rather than the quantile of a value for a given observation.

# Load survey package
library(survey)

# Set seed for replication
set.seed(123)

# Create data with value, group, weight
dat <- data.frame(value = 1:6, 
                  group = rep(1:3,2), 
                  weight = abs(rnorm(6))
# Declare survey design 
d <- survey::svydesign(id =~1, data = dat, weights = weight) 

# Do something to calculate the quantile and add it to the data
????

This is similar to this question but is not done by subgroup: Compute quantiles incorporating Sample Design (Survey package)

Upvotes: 0

Views: 2218

Answers (1)

socialscientist
socialscientist

Reputation: 4232

I put together a solution. The below sequence of statements in mutate() can be modified to convert the sampling weights into whatever quantiles are of interest. While this could be done in base R, I use the dplyr package due to the power of dplyr::bind_rows() to add in NAs when joining two data frames.

# Set seed for replication
set.seed(123)

# Create data with value, group, weight
dat <- data.frame(value = 1:6, 
                  group = rep(1:3,2), 
                  weight = abs(rnorm(6))

# Initialize list for storing group results
# Setting the length of the list is quicker than
# creating an empty list and growing it
quantile_list <- vector("list", length(unique(dat$group)))

# Initialize variable to indicate initial iteration
iteration <- 0

# estimate the decile of each respondent
# in a large for-loop

for(group in unique(dat$group)) {

# Keep only observations for a given group
  temp <- dat %>% dplyr::filter(group == group)

  # Create subset with missing values
  temp_missing <- temp %>% dplyr::filter(is.na(value))

  # Create subset without missing values
  temp_nonmissing <- temp %>% dplyr::filter(!is.na(value))

  # Sort observations with value on value, calculate cumulative
  # sum of sampling weights, create variable indicating the decile
  # of responses. 1 = lowest, 10 = highest
  temp_nonmissing <- temp_nonmissing %>% 
                            dplyr::arrange(value) %>%
                            dplyr::mutate(cumulative_weight = cumsum(weight),
                                          cumulative_weight_prop = cumulative_weight / sum(weight),
                                          decile = dplyr::case_when(cumulative_weight_prop < 0.10 ~ 1,
                                          cumulative_weight_prop >= 0.10 & cumulative_weight_prop < 0.20 ~ 2,
                                          cumulative_weight_prop >= 0.20 & cumulative_weight_prop < 0.30 ~ 3,
                                          cumulative_weight_prop >= 0.30 & cumulative_weight_prop < 0.40 ~ 4,
                                          cumulative_weight_prop >= 0.40 & cumulative_weight_prop < 0.50 ~ 5,
                                          cumulative_weight_prop >= 0.50 & cumulative_weight_prop < 0.60 ~ 6,
                                          cumulative_weight_prop >= 0.60 & cumulative_weight_prop < 0.70 ~ 7,
                                          cumulative_weight_prop >= 0.70 & cumulative_weight_prop < 0.80 ~ 8,
                                          cumulative_weight_prop >= 0.80 & cumulative_weight_prop < 0.90 ~ 9 ,
                                          cumulative_weight_prop >= 0.90 ~ 10))

  # Increment the iteration of the for loop
  iteration <- iteration + 1

  # Join the data with missing values and the data without
  # missing values on the value variable into
  # a single data frame
  quantile_list[[iteration]] <- dplyr::bind_rows(temp_nonmissing, temp_missing)
  }

# Convert the list of data frames into a single dataframe
out <- dplyr::bind_rows(quantile_list)

# Show outcome
head(out)

Upvotes: 0

Related Questions