Reputation: 99
I have a dataframe like this: If the person with that id has that fruit the value is 1, else it is 0. ID column is a primary key.
ID | Apple | Orange | Pear | Grapes |
---|---|---|---|---|
E1 | 1 | 0 | 1 | 1 |
E2 | 0 | 0 | 1 | 0 |
E3 | 0 | 1 | 1 | 0 |
E4 | 1 | 1 | 0 | 0 |
E5 | 1 | 0 | 0 | 1 |
I want a output as a matrix like this. This is a count of people who has both i and j fruit. Where i and j both are same, eg: cell apple x apple, we are looking at the count of all the people who has at least an apple apple. Here, 3 people have apple, so the value is 3. Similarly, we have only one person who has both apple and orange(E4) so the count is 1 in both (orange x apple) and (apple x orange) cells. We have 2 people in the above dataframe who has both grapes and apple(E1 and E5), so value in that cell is 2.
Edits: I want output like this
Apple | Orange | Pear | Grapes | |
---|---|---|---|---|
Apple | 3 | 1 | 1 | 2 |
Orange | 1 | 2 | 1 | 0 |
Pear | 1 | 1 | 3 | 1 |
Grapes | 2 | 0 | 1 | 2 |
I am new to R and Python, and not sure how to achieve this. Any help in either R or Python would be really appreciated! Please feel free to ask questions if anything is not clear or anything is missing. Thanks!
Upvotes: 1
Views: 148
Reputation: 50678
Here is a tidyverse
R option:
library(tidyverse)
data %>%
pivot_longer(-ID) %>%
filter(value > 0) %>%
select(-value) %>%
group_by(ID) %>%
nest() %>%
mutate(data = map(data, ~ expand.grid(.x$name, .x$name))) %>%
unnest(data) %>%
group_by(Var1, Var2) %>%
summarise(n = n(), .groups = "drop") %>%
filter(Var1 != Var2) %>%
pivot_wider(names_from = Var1, values_from = n, values_fill = 0)
## A tibble: 4 x 5
# Var2 Apple Pear Grapes Orange
# <fct> <int> <int> <int> <int>
#1 Pear 1 0 1 1
#2 Grapes 2 1 0 0
#3 Orange 1 1 0 0
#4 Apple 0 1 2 1
The idea is to use expand.grid
to generate all pairwise combinations. The rest is counting occurrences and reshaping. Re-order rows & columns as necessary.
PS. I should mention that the result is slightly different from the expected output. For example, I don't understand why (Pear, Pear) = 1. I assume this in an error. Since you're asking for co-occurrences, all diagonal elements should be zero.
data <- read.table(text = "ID Apple Orange Pear Grapes
E1 1 0 1 1
E2 0 0 1 0
E3 0 1 1 0
E4 1 1 0 0
E5 1 0 0 1", header = T)
Upvotes: 1
Reputation: 2554
Here is the example with Python3 and itertools.permutations
:
import itertools
import pandas as pd
# Create original dataframe
columns = ['ID', 'Apple', 'Orange', 'Pear', 'Grapes']
rows = [
['E1', 1, 0, 1, 1],
['E2', 0, 0, 1, 0],
['E3', 0, 1, 1, 0],
['E4', 1, 1, 0, 0],
['E5', 1, 0, 0, 1],
]
df = pd.DataFrame(rows, columns=columns)
# Count values of resulting matrix
def get_ones_indexes(els):
for i, el in enumerate(els):
if el == 1:
yield i
res_n = len(df.columns) - 1
res = [[0] * res_n for _ in range(res_n)]
for _, row in df.drop('ID', axis=1).iterrows():
indexes = list(get_ones_indexes(row.to_list()))
if len(indexes) == 1:
idx = indexes[0]
res[idx][idx] += 1
else:
for i, j in itertools.permutations(indexes, 2):
res[i][j] += 1
# Convert resultinng matrix to dataframe
_, *fruit_cols = df.columns
res_df = pd.DataFrame(res, index=fruit_cols, columns=fruit_cols)
print(res_df)
# Apple Orange Pear Grapes
# Apple 0 1 1 2
# Orange 1 0 1 0
# Pear 1 1 1 1
# Grapes 2 0 1 0
Upvotes: 1