Reputation: 39
I have struggled with this question for a long time, and I have looked extensively on the Internet but never found a solution. Imagine I have the following dataset:
df <- data.frame("Individuals" = c(1,2,3,4,5,6),
"Height" = c(150, 200, 200, 200, 150, 150),
"Weight" = c(100, 50, 50, 100, 50, 100))
This dataset has 6 individuals. For each individual, we measure two attributes: height (takes value 150 cm or 200 cm) and weight (takes value 50kg and 100kg). I want to create a categorical variable that classifies together individuals whose height and weight are equal. In this case, this variable would look like this:
output_df <- data.frame("Individuals" = c(1,2,3,4,5,6),
"Height" = c(150, 200, 200, 200, 150, 150),
"Weight" = c(100, 50, 50, 100, 50, 100),
"Groups of individuals" = c(1, 2, 2, 3, 4, 1))
There are four groups of individuals with equal values in both variables. In group 1, all have height = 150 and weight = 100, in group 2 all have height = 200 and weight = 50 , in group 3 all have a height = 200 and weight = 100 kg (there is only one individual in this group, but this would still be a separate "group of individuals" insofar it has a different combination of values of the other variables compared to the rest of the groups) and in group 4 all have a height of 150 cm and weight 50 kg (same as for group 3, only one individual in this group).
In this case, it is easy to make this classification manually and thus create the variable "Group of individuals". Now imagine I have more variables beyond height and weight, and I want to create the variable "Group of individuals" without knowing in advance the possible values height and weight (and other variables, if they exist) take. So I want to create a new variable, whose value depends on which group of observations a given observation is. The group of observations are defined by equality conditions; i.e., an observation is classified as pertaining to a given group of observations whose values across several variables are exactly equal.
I am finding it extremely difficult to write down the condition that defines this new variable in a generalized manner. The number of values this variable takes is not known a priori (depends on the specific set of individuals you have). It has a theoretical minum or 1 (all observations have equal values for all variables) and a theoretical maximum equal to the number of observations (all observations have different values for all variables, there are no groups of individuals with equal values for different variables). In my application, I want to create this variable for different datasets, therefore it will have a different number of values for each dataset.
My best attempts have involved the use of group_by() and case_when() within the tidyverse. I assume there has to be a way to express this as a if_else statement or some other type of conditional statement. Another intuition is that creating this variable might entail some kind of pivoting, creation of the variable, and then pivoting back again (also within the tidyverse: https://tidyr.tidyverse.org/articles/pivot.html ). I think the reason why the idea is challenging to me is that you create a variable that for each observations takes a given value as defined by equality conditions across observations, and not variables, which gets me very confused. This is why I guess it might be done with pivoting, because I think one might be able to translate this problem as creating a variable as a function of other variables first, and then come back to a dataset in which this variable is a function of equality across observations.
I really hope the formulation of the questiom is not too confusing. I find the issue so confusing myself, that it is also difficult to express it. I guess that if I could express it better, I might be able to solve it.
Thank you so much!
Upvotes: 0
Views: 779
Reputation: 389215
One way would be to create a unique key combining Height & Weight values and use match
and unique
to get group number.
key <- with(df, paste(Height, Weight, sep = '-'))
df$group <- match(key, unique(key))
df
# Individuals Height Weight group
#1 1 150 100 1
#2 2 200 50 2
#3 3 200 50 2
#4 4 200 100 3
#5 5 150 50 4
#6 6 150 100 1
If the order of groups are not important and you only care that people with same height and weight get the same group number, we can also use cur_group_id
from dplyr
.
library(dplyr)
df <- df %>% group_by(Height, Weight) %>% mutate(group = cur_group_id())
Upvotes: 1