Reputation: 7999
Here is the data that I am working with:
library(RCurl)
x <- getURL("https://raw.githubusercontent.com/dothemathonthatone/maps/master/main_test.csv")
maindf <- read.csv(text = x)
maindf_1 <- maindf %>%
dplyr::select(year, regional_schlüssel, age_group, fee_per_inc, fert_total, daily_hours, low_fee, middle_fee, high_fee)
head(maindf_1)
year regional_schlüssel fee_per_inc fert_total daily_hours low_fee middle_fee high_fee
2006 12246436188 0.000000000 0.02905331 8 1 0 0
2006 12246436188 0.002770760 0.02905331 8 1 0 0
2006 12246436188 0.003857333 0.02905331 8 1 0 0
2006 12246436188 0.004237633 0.02905331 8 0 1 0
2006 12246436188 0.004482112 0.02905331 8 0 1 0
2006 12246436188 0.005085077 0.02905331 8 0 1 0
To prepare the data for panel regression, I want to average the non-zero values in fee_per_inc
, the 4th column, according to the last three columns; e.g.,
year regional_schlüssel age_group fee_per_inc fert_total daily_hours low_fee middle_fee high_fee
2006 12246436188 -8 0.000000000 0.02905331 8 .003314047 0 0
2006 12246436188 -8 0.002770760 0.02905331 8 .003314047 0 0
2006 12246436188 -8 0.003857333 0.02905331 8 .003314047 0 0
2006 12246436188 -8 0.004237633 0.02905331 8 0 .004601607 0
2006 12246436188 -8 0.004482112 0.02905331 8 0 .004601607 0
2006 12246436188 -8 0.005085077 0.02905331 8 0 .004601607 0
and then remove the zeros in the last three rows:
year regional_schlüssel age_group fee_per_inc fert_total daily_hours low_fee middle_fee high_fee
2006 12246436188 -8 0.000000000 0.02905331 8 .003314047 .004601607 0
2006 12246436188 -8 0.002770760 0.02905331 8 .003314047 .004601607 0
2006 12246436188 -8 0.003857333 0.02905331 8 .003314047 .004601607 0
except that in this example the last row still has zeros.
After doing this I can drop fee_per_inc
, redundant rows, and go into panel regression.
Upvotes: 0
Views: 86
Reputation: 388797
One way to do this is by reshaping the data to long-format, remove rows with 0 values, grouping by regional_schlüssel
and unique column names. We can take mean
of non-zero fee_per_inc
values, cast the data to wide format and remove unwanted columns.
library(dplyr)
library(tidyr)
maindf_1 %>%
pivot_longer(cols = ends_with('fee')) %>%
filter(value != 0) %>%
group_by(regional_schlüssel, grp = data.table::rleid(name)) %>%
mutate(value = mean(fee_per_inc[fee_per_inc != 0], na.rm = TRUE),
row = row_number()) %>%
pivot_wider(values_fill = list(value = 0)) %>%
ungroup() %>%
select(-grp, -fee_per_inc, -row)
# year regional_schlüssel age_group fert_total daily_hours low_fee middle_fee high_fee
# <int> <dbl> <fct> <dbl> <fct> <dbl> <dbl> <dbl>
# 1 2006 12246436188. -8 0.0291 8 0.00331 0 0
# 2 2006 12246436188. -8 0.0291 8 0.00331 0 0
# 3 2006 12246436188. -8 0.0291 8 0.00331 0 0
# 4 2006 12246436188. -8 0.0291 8 0 0.00460 0
# 5 2006 12246436188. -8 0.0291 8 0 0.00460 0
# 6 2006 12246436188. -8 0.0291 8 0 0.00460 0
# 7 2006 12246436188. -8 0.0291 8 0.00197 0 0
# 8 2006 12246436188. -8 0.0291 8 0.00197 0 0
# 9 2006 12246436188. -8 0.0291 8 0.00197 0 0
#10 2006 12246436188. -8 0.0291 8 0 0.00308 0
# … with 9,907 more rows
Upvotes: 1
Reputation: 21
just a newbie here, but maybe something like this?
maindf_2 <- maindf_1 %>%
mutate(fee_per_inc = ifelse(fee_per_inc==0,NA, fee_per_inc)) %>%
group_by(low_fee, middle_fee, high_fee) %>%
mutate(low_fee_avg = ifelse(low_fee !=0, mean(fee_per_inc, na.rm = T), NA),
mid_fee_avg = ifelse(middle_fee !=0, mean(fee_per_inc, na.rm = T), NA),
high_fee_avg = ifelse(high_fee !=0, mean(fee_per_inc, na.rm = T), NA)
) %>%
ungroup() %>%
select(-ends_with("_fee"))
Upvotes: 2