Reputation: 209
This problem is a spin-off from my last post (How to calculate moving average for two years in r).
I have a big data frame (900k rows) about mergers and acquisitions (M&As).
The df has four columns: date (when the M&A was completed), target_nation (a company of which country was merged/acquired), acquiror_nation (corporation of which country was the acquiror), and big_corp_TF (whether the acquiror was a big corporation or not, where TRUE means that corporation is big). Here is a sample of my data:
> df <- structure(list(date = c(2000L, 2000L, 2001L, 2001L, 2001L, 2002L,
2002L, 2002L, 2003L, 2003L, 2004L, 2004L, 2004L, 2006L, 2006L
), target_nation = c("Uganda", "Uganda", "Uganda", "Uganda",
"Uganda", "Uganda", "Uganda", "Uganda", "Uganda", "Uganda", "Uganda",
"Uganda", "Uganda", "Uganda", "Uganda"), acquiror_nation = c("France",
"Germany", "France", "France", "Germany", "France", "France",
"Germany", "Germany", "Germany", "France", "France", "Germany",
"France", "France"), big_corp_TF = c(TRUE, FALSE, TRUE, FALSE, FALSE,
TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE)), row.names = c(NA,
-15L))
> df
date target_nation acquiror_nation big_corp_TF
1: 2000 Uganda France TRUE
2: 2000 Uganda Germany FALSE
3: 2001 Uganda France TRUE
4: 2001 Uganda France FALSE
5: 2001 Uganda Germany FALSE
6: 2002 Uganda France TRUE
7: 2002 Uganda France TRUE
8: 2002 Uganda Germany TRUE
9: 2003 Uganda Germany TRUE
10: 2003 Uganda Germany FALSE
11: 2004 Uganda France TRUE
12: 2004 Uganda France FALSE
13: 2004 Uganda Germany TRUE
14: 2006 Uganda France TRUE
15: 2006 Uganda France TRUE
NB: There are no rows for France in 2003; and there is no year 2005.
From these data, I want to create a new variable that denotes the share of M&As done by big corporations of specific acquiror nations, counting the average for 2 years. (For my actual exercise, I will count the averages for 5 years, but let's keep things simpler here). So there would be a new variable for France's big corporations, and a new variable for Germany's big corporations.
I was suggested to use the following code:
library(runner)
library(tidyverse)
df <- df %>% as.data.frame()
param <- 'France'
df %>%
group_by(date, target_nation) %>%
mutate(n1 = n()) %>%
group_by(date, target_nation, acquiror_nation) %>%
summarise(n1 = mean(n1),
n2 = sum(big_corp_TF), .groups = 'drop') %>%
filter(acquiror_nation == param) %>%
mutate(share = sum_run(n2, k=2)/sum_run(n1, k=2))
Which outputs this tibble:
date target_nation acquiror_nation n1 n2 share
<int> <chr> <chr> <dbl> <int> <dbl>
1 2000 Uganda France 2 1 0.5
2 2001 Uganda France 3 1 0.4
3 2002 Uganda France 3 2 0.5
4 2004 Uganda France 3 1 0.5
5 2006 Uganda France 2 2 0.6
NB: there is no result for France for 2003 and 2005; I would like there to be results for 2003 and 2005 (because we are calculating 2-year averages and thus we should be able to have results for 2003 and 2005). Also, the share for 2006 is incorrect in reality, because it should be 1 (it should take the values of 2005 (which are 0s) rather than the values of 2004 for the calculation of average).
I would like to be able to receive the following tibble:
date target_nation acquiror_nation n1 n2 share
<int> <chr> <chr> <dbl> <int> <dbl>
1 2000 Uganda France 2 1 0.5
2 2001 Uganda France 3 1 0.4
3 2002 Uganda France 3 2 0.5
4 2003 Uganda France 2 0 0.4
5 2004 Uganda France 3 1 0.2
6 2005 Uganda France 0 0 0.33
7 2006 Uganda France 2 2 1.0
NB: notice that the result for 2006 is also different (because we now take 2005 instead of 2004 for a two-year average).
I understand that this is a problem with the original data: it simply lacks certain data points. However, including them to the original data set seems to be highly inconvenient; it is probably better to include them mid-way, e.g. after counting the n1 and n2. But what is the most convenient way to do this?
Any suggestions are much appreciated.
Upvotes: 1
Views: 83
Reputation: 26218
use tidyr::complete
along with its arguments nesting
and fill
. Full code that may be used.
param <- 'France'
df %>%
mutate(d = 1) %>%
complete(date = seq(min(date), max(date), 1), nesting(target_nation, acquiror_nation),
fill = list(d =0, big_corp_TF = FALSE)) %>%
group_by(date, target_nation) %>%
mutate(n1 = sum(d)) %>%
group_by(date, target_nation, acquiror_nation) %>%
summarise(n1 = mean(n1),
n2 = sum(big_corp_TF), .groups = 'drop') %>%
filter(acquiror_nation == param) %>%
mutate(share = sum_run(n2, k=2)/sum_run(n1, k=2))
# A tibble: 7 x 6
date target_nation acquiror_nation n1 n2 share
<dbl> <chr> <chr> <dbl> <int> <dbl>
1 2000 Uganda France 2 1 0.5
2 2001 Uganda France 3 1 0.4
3 2002 Uganda France 3 2 0.5
4 2003 Uganda France 2 0 0.4
5 2004 Uganda France 3 1 0.2
6 2005 Uganda France 0 0 0.333
7 2006 Uganda France 2 2 1
Upvotes: 2
Reputation: 1234
df2 = df %>%
group_by(date, target_nation) %>%
mutate(n1 = n()) %>%
group_by(date, target_nation, acquiror_nation) %>%
summarise(n1 = mean(n1),
n2 = sum(big_corp_TF), .groups = 'drop') %>%
filter(acquiror_nation == param)
dates = seq(min(df2$date), max(df2$date), by = 1)
dates = setdiff(dates, df2$date)
df3 = df2[rep(nrow(df2), each = length(dates)), ]
df3$n1 = 0; df3$n2 = 0; df3$date = dates
df2 = arrange(rbind(df2,df3), date)
df2 = df2 %>% mutate(share = sum_run(n2, k=2)/sum_run(n1, k=2))
df2
# A tibble: 7 x 6
date target_nation acquiror_nation n1 n2 share
<dbl> <fct> <fct> <dbl> <dbl> <dbl>
1 2000 Uganda France 2 1 0.5
2 2001 Uganda France 3 1 0.4
3 2002 Uganda France 3 2 0.5
4 2003 Uganda France 0 0 0.667
5 2004 Uganda France 3 1 0.333
6 2005 Uganda France 0 0 0.333
7 2006 Uganda France 2 2 1
First, create df2
based on your df
but without calculating share
. Create a sequence of dates from the minimum to the maximum:
dates = seq(min(df2$date), max(df2$date), by = 1)
Leave just the ones that are missing in df2
:
dates = setdiff(dates, df2$date)
Create a row for each missing date and set n1
and n2
to 0:
df3 = df2[rep(nrow(df2), each = length(dates)), ]
df3$n1 = 0; df3$n2 = 0; df3$date = dates
Combine the rows and sort by date:
df2 = arrange(rbind(df2,df3), date)
Finally, calculate share
:
df2 = df2 %>% mutate(share = sum_run(n2, k=2)/sum_run(n1, k=2))
I apologise that this doesn't adhere to the tidyverse syntax
Upvotes: 0