Reputation: 984
This is a follow up question to one I posted previously because apparently I asked the wrong question.
I have two dataframes with relative frequencies of a certain combination of features. The relative frequencies in each one add up to 1. I'd like to join the two dataframes, which share one feature to obtain a new dataframe whose relative frequencies add up to 1 as well.
Here is a MWE:
I have two tibbles like so:
library(dplyr)
my_tib1 <- tibble(feature1 = c("A", "A", "B", "B", "C", "C"), feature2 = c("AA", "BB", "AA", "BB", "AA", "BB"), number = c(0.1, 0.1, 0.3, 0.4, 0.05, 0.05))
my_tib2 <- tibble(feature3 = c("TT", "TT", "FF", "FF"), feature2 = c("AA", "BB", "AA", "BB"), number = c(0.1, 0.4, 0.3, 0.2))
which looks like this:
# A tibble: 6 × 3
feature1 feature2 number
<chr> <chr> <dbl>
1 A AA 0.1
2 A BB 0.1
3 B AA 0.3
4 B BB 0.4
5 C AA 0.05
6 C BB 0.05
# A tibble: 4 × 3
feature3 feature2 number
<chr> <chr> <dbl>
1 TT AA 0.1
2 TT BB 0.4
3 FF AA 0.3
4 FF BB 0.2
Note that feature2
has the same categories in both tibbles. The number
is unique for each combination of feature1
and feature2
in my_tib1 and feature2
and feature3
in my_tib2.
For context: The number
column represents marginal probabilities and I'd like to multiply the marginal distributions to get joint distributions (I'm aware of the assumptions).
What I think this requires is to get all possible combinations of feature 1, feature2, and feature3 and multiply their number
in a new tibble column. The resulting tibble should have a length of 12: 3 x feature1, 2 x feature2, 2 x feature3.
The final tibble should something like this:
# A tibble: 12 × 6
feature1 feature2 feature3 number.x number.y number.mult
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 A AA TT 0.1 0.1 0.01
2 A AA FF 0.1 0.4 0.04
...
with the numbers in number.mult adding up to 1.
I have tried the following and I think I'm close but it doesn't quite work:
my_tib1 %>% full_join(my_tib2, by = "feature2") %>% mutate(number.mult = number.x*number.y)
This just gives me the 12x6 tibble I'm looking for but the numbers in number.mult do not add up to 1.
Upvotes: 0
Views: 452
Reputation: 1683
I think maybe there is some confusion. You are trying to calculate the joint distribution of 3 independent variables. But if you calculate the marginal distribution for feature2
in both tibbles, you will see they are not the same, so likely they are not independent or some bias. Anyway, the joint distribution is dependent on the marginal frequencies of the variables, you cannot usually mix 2. You are trying to multiply 2 joint distributions of 2 combinations of variables.
What do you have to do is to multiply the joint distribution of features1 and 2 by the marginal distribution of feature 3.
my_tib1
is your first joint distribution. Which is:
# A tibble: 6 x 3
feature1 feature2 number
<chr> <chr> <dbl>
1 A AA 0.1
2 A BB 0.1
3 B AA 0.3
4 B BB 0.4
5 C AA 0.05
6 C BB 0.05
Or as a table:
library(tidyverse)
my_tib1 %>% pivot_wider(names_from = feature2, values_from=number)
# A tibble: 3 x 3
feature1 AA BB
<chr> <dbl> <dbl>
1 A 0.1 0.1
2 B 0.3 0.4
3 C 0.05 0.05
Your second table of relative frequencies or joint distribution is:
my_tib2 %>% pivot_wider(names_from = feature2, values_from=number)
# A tibble: 2 x 3
feature3 AA BB
<chr> <dbl> <dbl>
1 TT 0.1 0.4
2 FF 0.3 0.2
You can calculate the marginal distribution of feature3
. As you see, it sums 1.
marginals3 = my_tib2 %>%
pivot_wider(names_from = feature2, values_from=number) %>%
rowwise() %>%
mutate(marginals3 = AA+BB)
> marginals3
# A tibble: 2 x 4
# Rowwise:
feature3 AA BB marginals3
<chr> <dbl> <dbl> <dbl>
1 TT 0.1 0.4 0.5
2 FF 0.3 0.2 0.5
You don't need to pivot to calculate it, just group by 'feature3':
marginals3 = my_tib2 %>%
group_by(feature3) %>%
summarise(feature2, marginals3 = sum(number))
If you summarise it with feature2
you can combine it with my_tib1
to calculate the resulting joint distribution = frequencies my_tab1 * marginals(feature3)
:
my_tib1 %>%
left_join(marginals3, by='feature2') %>%
mutate(number.mult = number*marginals3)
If you summarise(sum(number.mult))
you will see the result is 1
.
Upvotes: 1