Collective Action
Collective Action

Reputation: 7999

average column rows according to value in other columns

Here is the data that I am working with:

library(RCurl)
x <- getURL("https://raw.githubusercontent.com/dothemathonthatone/maps/master/main_test.csv")
maindf <- read.csv(text = x)

maindf_1 <- maindf %>% 
             dplyr::select(year, regional_schlüssel, age_group, fee_per_inc, fert_total, daily_hours, low_fee, middle_fee, high_fee) 

head(maindf_1)

year    regional_schlüssel  fee_per_inc fert_total  daily_hours low_fee middle_fee  high_fee
2006    12246436188 0.000000000 0.02905331  8   1   0   0
2006    12246436188 0.002770760 0.02905331  8   1   0   0
2006    12246436188 0.003857333 0.02905331  8   1   0   0
2006    12246436188 0.004237633 0.02905331  8   0   1   0
2006    12246436188 0.004482112 0.02905331  8   0   1   0
2006    12246436188 0.005085077 0.02905331  8   0   1   0 

To prepare the data for panel regression, I want to average the non-zero values in fee_per_inc, the 4th column, according to the last three columns; e.g.,

year    regional_schlüssel  age_group   fee_per_inc fert_total  daily_hours low_fee middle_fee  high_fee
2006    12246436188 -8  0.000000000 0.02905331  8   .003314047  0   0
2006    12246436188 -8  0.002770760 0.02905331  8   .003314047  0   0
2006    12246436188 -8  0.003857333 0.02905331  8   .003314047  0   0
2006    12246436188 -8  0.004237633 0.02905331  8   0   .004601607  0
2006    12246436188 -8  0.004482112 0.02905331  8   0   .004601607  0
2006    12246436188 -8  0.005085077 0.02905331  8   0   .004601607  0  

and then remove the zeros in the last three rows:

year    regional_schlüssel  age_group   fee_per_inc fert_total  daily_hours low_fee middle_fee  high_fee
2006    12246436188 -8  0.000000000 0.02905331  8   .003314047  .004601607  0
2006    12246436188 -8  0.002770760 0.02905331  8   .003314047  .004601607  0
2006    12246436188 -8  0.003857333 0.02905331  8   .003314047  .004601607  0

except that in this example the last row still has zeros. After doing this I can drop fee_per_inc, redundant rows, and go into panel regression.

Upvotes: 0

Views: 86

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 388797

One way to do this is by reshaping the data to long-format, remove rows with 0 values, grouping by regional_schlüssel and unique column names. We can take mean of non-zero fee_per_inc values, cast the data to wide format and remove unwanted columns.

library(dplyr)
library(tidyr)


maindf_1 %>%
  pivot_longer(cols = ends_with('fee')) %>%
  filter(value != 0) %>%
  group_by(regional_schlüssel, grp = data.table::rleid(name)) %>%
  mutate(value =  mean(fee_per_inc[fee_per_inc != 0],  na.rm = TRUE), 
         row = row_number()) %>%
  pivot_wider(values_fill = list(value  =  0)) %>%
  ungroup() %>%
  select(-grp, -fee_per_inc, -row)


#    year regional_schlüssel age_group fert_total daily_hours low_fee middle_fee high_fee
#   <int>              <dbl> <fct>          <dbl> <fct>         <dbl>      <dbl>    <dbl>
# 1  2006       12246436188. -8            0.0291 8           0.00331    0              0
# 2  2006       12246436188. -8            0.0291 8           0.00331    0              0
# 3  2006       12246436188. -8            0.0291 8           0.00331    0              0
# 4  2006       12246436188. -8            0.0291 8           0          0.00460        0
# 5  2006       12246436188. -8            0.0291 8           0          0.00460        0
# 6  2006       12246436188. -8            0.0291 8           0          0.00460        0
# 7  2006       12246436188. -8            0.0291 8           0.00197    0              0
# 8  2006       12246436188. -8            0.0291 8           0.00197    0              0
# 9  2006       12246436188. -8            0.0291 8           0.00197    0              0
#10  2006       12246436188. -8            0.0291 8           0          0.00308        0
 # … with 9,907 more rows

Upvotes: 1

Enrico Galli
Enrico Galli

Reputation: 21

just a newbie here, but maybe something like this?

maindf_2 <- maindf_1 %>% 
  mutate(fee_per_inc = ifelse(fee_per_inc==0,NA, fee_per_inc)) %>% 
  group_by(low_fee, middle_fee, high_fee) %>% 
  mutate(low_fee_avg = ifelse(low_fee !=0, mean(fee_per_inc, na.rm = T), NA),
         mid_fee_avg = ifelse(middle_fee !=0, mean(fee_per_inc, na.rm = T), NA),
         high_fee_avg = ifelse(high_fee !=0, mean(fee_per_inc, na.rm = T), NA)
  ) %>% 
  ungroup() %>% 
  select(-ends_with("_fee"))

Upvotes: 2

Related Questions