Reputation: 161
Say I have one data frame of tooth brush brands and a measure of how popular they are over time:
year brand_1 brand_2
2010 0.7 0.3
2011 0.6 0.6
2012 0.4 0.9
And another that says when each tooth brush brand went electrical, with NA
meaning they never did so:
brand went_electrical_year
brand_1 NA
brand_2 2011
Now I'd like to combine these to get the prevalence of electrical tooth brush brands (as a proportion of the total) each year:
year electrical_prevalence
2010 0
2011 0.5
2012 0.69
In 2010 it's 0 b/c none of the brands are electrical. In 2011 it's 0.5 b/c both are and they are equally prevalent. In 2012 it's 0.69 b/c both are but the electrical one is more prevalent.
I've wrestled with this in R but can't figure out a way to do it. Would appreciate any help or suggestions. Cheers.
Upvotes: 1
Views: 39
Reputation: 30474
Assuming your data frames are df1
and df2
, you can use the following tidyverse
approach.
First, use pivot_longer
to put your data into a long format which will be easier to manipulate. Use left_join
to add the relevant years of when the brands went electrical.
We can create an indicator mult
which will be 1 if the brand has gone electrical, or zero if it hadn't.
Then, for each year, you can determine the proportion by multiplying the popularity value by mult
for each brand, and then dividing by the total sum for that year.
library(tidyverse)
df1 %>%
pivot_longer(cols = -year) %>%
left_join(df2, by = c("name" = "brand")) %>%
mutate(mult = ifelse(went_electrical_year > year | is.na(went_electrical_year), 0, 1)) %>%
group_by(year) %>%
summarise(electrical_prevalence = sum(value * mult) / sum(value))
Output
year electrical_prevalence
<int> <dbl>
1 2010 0
2 2011 0.5
3 2012 0.692
Upvotes: 2