Reputation: 1000
This is an approximation of the original dataframe. In the original, there are many more columns than are shown here.
id init_cont family description value
1 K S impacteach 1
1 K S impactover 3
1 K S read 2
2 I S impacteach 2
2 I S impactover 4
2 I S read 1
3 K D impacteach 3
3 K D impactover 5
3 K D read 3
I want to combine the values for impacteach and impactover to generate an average value that is just called impact. I would like the final table to look like the following:
id init_cont family description value
1 K S impact 2
1 K S read 2
2 I S impact 3
2 I S read 1
3 K D impact 4
3 K D read 3
I have not been able to figure out how to generate this table. However, I have been able to create a dataframe that looks like this:
id description value
1 impact 2
1 read 2
2 impact 3
2 read 1
3 impact 4
3 read 3
What is the best way for me to take these new values and add them to the original dataframe? I also need to remove the original values (like impacteach and impactover) in the original dataframe. I would prefer to modify the original dataframe as opposed to creating an entirely new dataframe because the original dataframe has many columns.
In case it is useful, this is a summary of the code I used to create the shorter dataframe with impact as a combination of impacteach and impactover:
df %<%
mutate(newdescription = case_when(description %in% c("impacteach", "impactoverall") ~ "impact", TRUE ~ description)) %<%
group_by(id, newdescription) %<%
summarise(value = mean(as.numeric(value)))
Upvotes: 1
Views: 215
Reputation: 20085
gsub
can be used to replace description
containing imact
as impact and then group_by
from dplyr
package will help in summarising the value.
df %>% group_by(id, init_cont, family,
description = gsub("^(impact).*","\\1", description)) %>%
summarise(value = mean(value))
# # A tibble: 6 x 5
# # Groups: id, init_cont, family [?]
# id init_cont family description value
# <int> <chr> <chr> <chr> <dbl>
# 1 1 K S impact 2.00
# 2 1 K S read 2.00
# 3 2 I S impact 3.00
# 4 2 I S read 1.00
# 5 3 K D impact 4.00
# 6 3 K D read 3.00
Upvotes: 0
Reputation: 4537
You just need to modify your group_by
statement. Try group_by(id, init_cont, family)
Because your id seems to be mapped to init_cont and family already, adding in these values won't change your summarization result. Then you have all the columns you want with no extra work.
If you have a lot of columns you could trying something like the code below. Essentially, do a left_join
onto your original data with your summarised data, but doing it using the .
to not store off a new dataframe. Then, once joined (by id and description which we modified in place) you'll have two value columns which should be prepeneded with a .x and .y, drop the original and then use distinct to get rid of the duplicate 'impact' columns.
df %>%
mutate(description = case_when(description %in% c("impacteach", "impactoverall") ~ "impact", TRUE ~ description)) %>%
left_join(. %>%
group_by(id, description)
summarise(value = mean(as.numeric(value))
,by=c('id','description')) %>%
select(-value.x) %>%
distinct()
Upvotes: 1
Reputation: 5191
What if you changed the description
column first so that it could be included in the grouping:
df %>%
mutate(description = substr(description, 1, 6)) %>%
group_by(id, init_cont, family, description) %>%
summarise(value = mean(value))
# A tibble: 6 x 5
# Groups: id, init_cont, family [?]
# id init_cont family description value
# <int> <chr> <chr> <chr> <dbl>
# 1 1 K S impact 2.
# 2 1 K S read 2.
# 3 2 I S impact 3.
# 4 2 I S read 1.
# 5 3 K D impact 4.
# 6 3 K D read 3.
Upvotes: 4