Reputation: 354
I want to re-work the data that I have in a dataframe based on adding certain values up, and to do that to all of the (numerical) columns in the same way.
In code, I have created a dataframe that is structured a bit like this:
library(tibble)
df_in <- tribble(~names, ~a, ~a_pc, ~b, ~b_pc,
"Three star", 1L, 1, 2L, 1,
"Two star", 5L, 5, 12L, 6,
"One star", 6L, 6, 100L, 50,
"No star", 88L, 88, 86L, 43,
"Empty", 0L, 0, 0L, 0,
"Also empty", 0L, 0, 0L, 0)
In my output I want to have one row that contains the sums of three rows in the input dataframe, another row that contains the sum of two of them, and one that contains the contents of a row from the original (but renamed).
I also want to keep other rows if they have numbers but to drop them if they are empty. I would prefer to do that programmatically, but can do it manually with indexing if need be, so that's a bit less important.
My desired output would be a bit like this:
df_out <- tribble(~names, ~a, ~a_pc, ~b, ~b_pc,
"Any stars", 12L, 12, 114L, 57,
"... of which at least 2 stars", 6L, 6, 14L, 7,
"... of which 3 stars", 1L, 1, 2L, 1,
"No star", 88L, 88, 86L, 43)
For example, that 12L
in the top left (meaning column a, "Any stars") is the sum of the 1L
, 5L
and 6L
entries in the a column of the input.
I want to do this merging of rows at this stage in my processing because it's important to do it after I've already calculated the percentage columns (..._pc
in the example). You'll see that in the output the percentage columns add to more than 100, which is correct because there is deliberately some 'double counting' - things can correctly show up in multiple rows if they meet the conditions.
Edit to add: Note that the labels I am using in the $names
column of the test dataset df_in
are not the real labels I have in my real situation. I imagine that a workable solution to this will be able to somehow take a collection of vectors that specify sets of rows and another collection of the same number of strings to label the new rows, and process through them. I might be able to define the sets of rows and the associated names like this:
set_1 <- c("Three star", "Two star", "One star")
set_2 <- c("Three star", "Two star")
set_3 <- "Three star"
set_4 <- "No star"
new_name_1 <- "Any stars"
new_name_2 <- "... of which at least 2 stars"
new_name_3 <- "... of which 3 stars"
new_name_4 <- "No star"
Upvotes: 0
Views: 235
Reputation: 887118
We may use imap
to loop over patterns (as some cases are overlapping) and do the group by sum
across
those columns (after filter
ing the rows)
library(purrr)
library(stringr)
imap_dfr(setNames(c('(?<!No) star', 'Two|Three', 'Three', 'Empty|No'),
c("Any stars", "... of which at least 2 stars",
"... of which 3 stars", "No star" )), ~ df_in %>%
filter(str_detect(names, regex(.x, ignore_case = TRUE))) %>%
group_by(names = .y) %>%
summarise(across(everything(), sum)))
-ouptut
# A tibble: 4 x 5
names a a_pc b b_pc
<chr> <int> <dbl> <int> <dbl>
1 Any stars 12 12 114 57
2 ... of which at least 2 stars 6 6 14 7
3 ... of which 3 stars 1 1 2 1
4 No star 88 88 86 43
OP's expected
> df_out
# A tibble: 4 x 5
names a a_pc b b_pc
<chr> <int> <dbl> <int> <dbl>
1 Any stars 12 12 114 57
2 ... of which at least 2 stars 6 6 14 7
3 ... of which 3 stars 1 1 2 1
4 No star 88 88 86 43
If the OP is passing a custom set of names
map2_dfr(mget(ls(pattern = '^set_\\d+')),
mget(ls(pattern = '^new_name_\\d+')),
~ df_in %>%
filter(names %in% .x) %>%
group_by(names = .y) %>%
summarise(across(everything(), sum)))
# A tibble: 4 x 5
names a a_pc b b_pc
<chr> <int> <dbl> <int> <dbl>
1 Any stars 12 12 114 57
2 ... of which at least 2 stars 6 6 14 7
3 ... of which 3 stars 1 1 2 1
4 No star 88 88 86 43
Upvotes: 1