ella
ella

Reputation: 99

Matrix from dataframe in R/Python

I have a dataframe like this: If the person with that id has that fruit the value is 1, else it is 0. ID column is a primary key.

ID Apple Orange Pear Grapes
E1 1 0 1 1
E2 0 0 1 0
E3 0 1 1 0
E4 1 1 0 0
E5 1 0 0 1

I want a output as a matrix like this. This is a count of people who has both i and j fruit. Where i and j both are same, eg: cell apple x apple, we are looking at the count of all the people who has at least an apple apple. Here, 3 people have apple, so the value is 3. Similarly, we have only one person who has both apple and orange(E4) so the count is 1 in both (orange x apple) and (apple x orange) cells. We have 2 people in the above dataframe who has both grapes and apple(E1 and E5), so value in that cell is 2.

Edits: I want output like this

Apple Orange Pear Grapes
Apple 3 1 1 2
Orange 1 2 1 0
Pear 1 1 3 1
Grapes 2 0 1 2

I am new to R and Python, and not sure how to achieve this. Any help in either R or Python would be really appreciated! Please feel free to ask questions if anything is not clear or anything is missing. Thanks!

Upvotes: 1

Views: 148

Answers (2)

Maurits Evers
Maurits Evers

Reputation: 50678

Here is a tidyverse R option:

library(tidyverse)
data %>%
    pivot_longer(-ID) %>%
    filter(value > 0) %>%
    select(-value) %>%
    group_by(ID) %>%
    nest() %>%
    mutate(data = map(data, ~ expand.grid(.x$name, .x$name))) %>%
    unnest(data) %>%
    group_by(Var1, Var2) %>%
    summarise(n = n(), .groups = "drop") %>%
    filter(Var1 != Var2) %>%
    pivot_wider(names_from = Var1, values_from = n, values_fill = 0)
## A tibble: 4 x 5
#  Var2   Apple  Pear Grapes Orange
#  <fct>  <int> <int>  <int>  <int>
#1 Pear       1     0      1      1
#2 Grapes     2     1      0      0
#3 Orange     1     1      0      0
#4 Apple      0     1      2      1

The idea is to use expand.grid to generate all pairwise combinations. The rest is counting occurrences and reshaping. Re-order rows & columns as necessary.

PS. I should mention that the result is slightly different from the expected output. For example, I don't understand why (Pear, Pear) = 1. I assume this in an error. Since you're asking for co-occurrences, all diagonal elements should be zero.


Sample data

data <- read.table(text = "ID   Apple   Orange  Pear    Grapes
E1  1   0   1   1
E2  0   0   1   0
E3  0   1   1   0
E4  1   1   0   0
E5  1   0   0   1", header = T)

Upvotes: 1

Alex Kosh
Alex Kosh

Reputation: 2554

Here is the example with Python3 and itertools.permutations:

import itertools

import pandas as pd

# Create original dataframe
columns = ['ID', 'Apple', 'Orange', 'Pear', 'Grapes']
rows = [
    ['E1', 1, 0, 1, 1],
    ['E2', 0, 0, 1, 0],
    ['E3', 0, 1, 1, 0],
    ['E4', 1, 1, 0, 0],
    ['E5', 1, 0, 0, 1],
]
df = pd.DataFrame(rows, columns=columns)

# Count values of resulting matrix
def get_ones_indexes(els):
    for i, el in enumerate(els):
        if el == 1:
            yield i

res_n = len(df.columns) - 1
res = [[0] * res_n for _ in range(res_n)]

for _, row in df.drop('ID', axis=1).iterrows():
    indexes = list(get_ones_indexes(row.to_list()))

    if len(indexes) == 1:
        idx = indexes[0]
        res[idx][idx] += 1
    else:
        for i, j in itertools.permutations(indexes, 2):
            res[i][j] += 1

# Convert resultinng matrix to dataframe
_, *fruit_cols = df.columns
res_df = pd.DataFrame(res, index=fruit_cols, columns=fruit_cols)

print(res_df)
#         Apple  Orange  Pear  Grapes
# Apple       0       1     1       2
# Orange      1       0     1       0
# Pear        1       1     1       1
# Grapes      2       0     1       0

Upvotes: 1

Related Questions