Tea Tree
Tea Tree

Reputation: 984

How to compute joint distribution from marginal distributions given independence?

This is a follow up question to one I posted previously because apparently I asked the wrong question.

I have two dataframes with relative frequencies of a certain combination of features. The relative frequencies in each one add up to 1. I'd like to join the two dataframes, which share one feature to obtain a new dataframe whose relative frequencies add up to 1 as well.

Here is a MWE:

I have two tibbles like so:

library(dplyr)
my_tib1 <- tibble(feature1 = c("A", "A", "B", "B", "C", "C"), feature2 = c("AA", "BB", "AA", "BB", "AA", "BB"), number = c(0.1, 0.1, 0.3, 0.4, 0.05, 0.05))
my_tib2 <- tibble(feature3 = c("TT", "TT", "FF", "FF"), feature2 = c("AA", "BB", "AA", "BB"), number = c(0.1, 0.4, 0.3, 0.2))

which looks like this:

# A tibble: 6 × 3
  feature1 feature2 number
  <chr>    <chr>     <dbl>
1 A        AA          0.1
2 A        BB          0.1
3 B        AA          0.3
4 B        BB          0.4
5 C        AA          0.05
6 C        BB          0.05

# A tibble: 4 × 3
  feature3 feature2 number
  <chr>    <chr>     <dbl>
1 TT       AA          0.1
2 TT       BB          0.4
3 FF       AA          0.3
4 FF       BB          0.2

Note that feature2 has the same categories in both tibbles. The number is unique for each combination of feature1 and feature2 in my_tib1 and feature2 and feature3 in my_tib2.

For context: The number column represents marginal probabilities and I'd like to multiply the marginal distributions to get joint distributions (I'm aware of the assumptions).

What I think this requires is to get all possible combinations of feature 1, feature2, and feature3 and multiply their number in a new tibble column. The resulting tibble should have a length of 12: 3 x feature1, 2 x feature2, 2 x feature3.

The final tibble should something like this:

# A tibble: 12 × 6
  feature1 feature2 feature3  number.x  number.y  number.mult
  <chr>    <chr>    <chr>     <dbl>     <dbl>     <dbl>
1 A        AA       TT        0.1       0.1       0.01
2 A        AA       FF        0.1       0.4       0.04
...

with the numbers in number.mult adding up to 1.

I have tried the following and I think I'm close but it doesn't quite work:

my_tib1 %>% full_join(my_tib2, by = "feature2") %>% mutate(number.mult = number.x*number.y)

This just gives me the 12x6 tibble I'm looking for but the numbers in number.mult do not add up to 1.

Upvotes: 0

Views: 452

Answers (1)

RobertoT
RobertoT

Reputation: 1683

I think maybe there is some confusion. You are trying to calculate the joint distribution of 3 independent variables. But if you calculate the marginal distribution for feature2 in both tibbles, you will see they are not the same, so likely they are not independent or some bias. Anyway, the joint distribution is dependent on the marginal frequencies of the variables, you cannot usually mix 2. You are trying to multiply 2 joint distributions of 2 combinations of variables.

What do you have to do is to multiply the joint distribution of features1 and 2 by the marginal distribution of feature 3.

my_tib1 is your first joint distribution. Which is:

# A tibble: 6 x 3
  feature1 feature2 number
  <chr>    <chr>     <dbl>
1 A        AA         0.1 
2 A        BB         0.1 
3 B        AA         0.3 
4 B        BB         0.4 
5 C        AA         0.05
6 C        BB         0.05

Or as a table:

library(tidyverse)
my_tib1 %>% pivot_wider(names_from = feature2, values_from=number)
    # A tibble: 3 x 3
      feature1    AA    BB
      <chr>    <dbl> <dbl>
    1 A         0.1   0.1 
    2 B         0.3   0.4 
    3 C         0.05  0.05

Your second table of relative frequencies or joint distribution is:

my_tib2 %>% pivot_wider(names_from = feature2, values_from=number)
# A tibble: 2 x 3
  feature3    AA    BB
  <chr>    <dbl> <dbl>
1 TT         0.1   0.4
2 FF         0.3   0.2

You can calculate the marginal distribution of feature3. As you see, it sums 1.

marginals3 = my_tib2 %>% 
  pivot_wider(names_from = feature2, values_from=number) %>% 
  rowwise() %>% 
  mutate(marginals3 = AA+BB) 
> marginals3
# A tibble: 2 x 4
# Rowwise: 
  feature3    AA    BB marginals3
  <chr>    <dbl> <dbl>     <dbl>
1 TT         0.1   0.4       0.5
2 FF         0.3   0.2       0.5

You don't need to pivot to calculate it, just group by 'feature3':

marginals3 = my_tib2 %>% 
  group_by(feature3) %>% 
  summarise(feature2, marginals3 = sum(number))

If you summarise it with feature2 you can combine it with my_tib1 to calculate the resulting joint distribution = frequencies my_tab1 * marginals(feature3):

 my_tib1 %>% 
    left_join(marginals3, by='feature2') %>% 
    mutate(number.mult = number*marginals3)

If you summarise(sum(number.mult)) you will see the result is 1.

Upvotes: 1

Related Questions