jamse
jamse

Reputation: 354

Calculate a new dataframe via applying calculations to rows in R

I want to re-work the data that I have in a dataframe based on adding certain values up, and to do that to all of the (numerical) columns in the same way.

In code, I have created a dataframe that is structured a bit like this:

library(tibble)

df_in <- tribble(~names,        ~a,  ~a_pc,   ~b,   ~b_pc,
                 "Three star",   1L,     1,    2L,      1,
                 "Two star",     5L,     5,   12L,      6,
                 "One star",     6L,     6,  100L,     50,
                 "No star",     88L,    88,   86L,     43,
                 "Empty",        0L,     0,    0L,      0,
                 "Also empty",   0L,     0,    0L,      0)

In my output I want to have one row that contains the sums of three rows in the input dataframe, another row that contains the sum of two of them, and one that contains the contents of a row from the original (but renamed).

I also want to keep other rows if they have numbers but to drop them if they are empty. I would prefer to do that programmatically, but can do it manually with indexing if need be, so that's a bit less important.

My desired output would be a bit like this:


df_out <- tribble(~names,                          ~a,  ~a_pc,   ~b,   ~b_pc,
                  "Any stars",                    12L,    12,  114L,     57,
                  "... of which at least 2 stars", 6L,     6,   14L,      7,
                  "... of which 3 stars",          1L,     1,    2L,      1,
                  "No star",                      88L,    88,   86L,     43)

For example, that 12L in the top left (meaning column a, "Any stars") is the sum of the 1L, 5L and 6L entries in the a column of the input.

I want to do this merging of rows at this stage in my processing because it's important to do it after I've already calculated the percentage columns (..._pc in the example). You'll see that in the output the percentage columns add to more than 100, which is correct because there is deliberately some 'double counting' - things can correctly show up in multiple rows if they meet the conditions.


Edit to add: Note that the labels I am using in the $names column of the test dataset df_in are not the real labels I have in my real situation. I imagine that a workable solution to this will be able to somehow take a collection of vectors that specify sets of rows and another collection of the same number of strings to label the new rows, and process through them. I might be able to define the sets of rows and the associated names like this:

set_1 <- c("Three star", "Two star", "One star")
set_2 <- c("Three star", "Two star")
set_3 <- "Three star"
set_4 <- "No star"

new_name_1 <- "Any stars"
new_name_2 <- "... of which at least 2 stars"
new_name_3 <- "... of which 3 stars"
new_name_4 <- "No star"

Upvotes: 0

Views: 235

Answers (1)

akrun
akrun

Reputation: 887118

We may use imap to loop over patterns (as some cases are overlapping) and do the group by sum across those columns (after filtering the rows)

library(purrr)
library(stringr)

imap_dfr(setNames(c('(?<!No) star', 'Two|Three', 'Three', 'Empty|No'), 
   c("Any stars", "... of which at least 2 stars", 
      "... of which 3 stars", "No star" )), ~ df_in %>% 
        filter(str_detect(names, regex(.x, ignore_case = TRUE))) %>% 
        group_by(names = .y) %>% 
        summarise(across(everything(), sum)))

-ouptut

# A tibble: 4 x 5
  names                             a  a_pc     b  b_pc
  <chr>                         <int> <dbl> <int> <dbl>
1 Any stars                        12    12   114    57
2 ... of which at least 2 stars     6     6    14     7
3 ... of which 3 stars              1     1     2     1
4 No star                          88    88    86    43

OP's expected

> df_out
# A tibble: 4 x 5
  names                             a  a_pc     b  b_pc
  <chr>                         <int> <dbl> <int> <dbl>
1 Any stars                        12    12   114    57
2 ... of which at least 2 stars     6     6    14     7
3 ... of which 3 stars              1     1     2     1
4 No star                          88    88    86    43

Update

If the OP is passing a custom set of names

map2_dfr(mget(ls(pattern = '^set_\\d+')),
         mget(ls(pattern = '^new_name_\\d+')), 
         ~ df_in %>%
                   filter(names %in% .x)  %>%
                   group_by(names = .y) %>%
                   summarise(across(everything(), sum)))
                   
# A tibble: 4 x 5
  names                             a  a_pc     b  b_pc
  <chr>                         <int> <dbl> <int> <dbl>
1 Any stars                        12    12   114    57
2 ... of which at least 2 stars     6     6    14     7
3 ... of which 3 stars              1     1     2     1
4 No star                          88    88    86    43

Upvotes: 1

Related Questions