Reputation: 3621
I have a data frame (df
) or data table (dt
) with, let’s say 1000 variables and 1000 observations. I checked that there are no duplicates in the observations, so dt[!duplicated(dt)]
has the same length as the original file.
I would like to create an ID variable for all this observation with a combination of some of the 1000 variables I have. Differently to other SO questions as I don’t know which variables are more suitable to create the ID and it is likely that I need a combination of, at least, 3 or 4 variables.
Is there any package/function in R that could get me the most efficient combination of variables to create an ID variable? In my real example I am struggling to create an ID manually, and probably it is not the best combination of variables.
Example with mtcars:
require(data.table)
example <- data.table(mtcars)
rownames(example) <- NULL # Delete mtcars row names
example <- example[!duplicated(example),]
example[,id_var_wrong := paste0(mpg,"_",cyl)]
length(unique(example$id_var_wrong)) # Wrong ID, there are only 27 different values for this variable despite 32 observations
example[,id_var_good := paste0(wt,"_",qsec)]
length(unique(example$id_var_good)) # Good ID as there are equal number of unique values as different observations.
Is there any function to find wt
and qsec
automatically and not manually?
Upvotes: 4
Views: 3225
Reputation: 1011
Based on @F. Privé's answer; You can optionally specify a 'startVar' if you have an idea of what might be best, otherwise to start it will just pick the var with the max number of distinct values.
library(dplyr)
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
columnsID <- function(dataset,
startVar = NULL,
frac = 1) {
#Set up some temporary dataframes
#Remove any total duplicates with distinct
#Take a sample if not working on the full data
tibb <- as_tibble(dataset) %>%
distinct() %>%
sample_frac(frac)
#Set up the vars which will be used
if (is.null(startVar)) {
startVar <- names(which.max(lapply(tibb, n_distinct)))[[1]]
}
vars_agg <- c(startVar)
vars_all <- names(tibb)
#Filter out any rows which are already uniquely identified
tibb <- tibb %>% group_by_at(vars(vars_agg)) %>%
filter(n() > 1)
while (nrow(tibb) > 0) {
#Keep track of the vars we haven't used yet
vars_unused <- setdiff(vars_all, vars_agg)
#Find the variable which has the most distinct number of values on average
#for the grouping we have so far
var_best <-
tibb %>%
group_by(!!!syms(vars_agg)) %>%
mutate_at(vars(vars_unused), funs(n_distinct(.))) %>%
ungroup() %>%
summarise_at(vars(vars_unused), funs(mean)) %>%
which.max() %>%
names()
#Add the 'best variable' to the list
vars_agg <- c(vars_agg, var_best)
#Filter out any rows which are now uniquely identified
tibb <- tibb %>%
group_by_at(vars(vars_agg)) %>%
filter(n() > 1)
}
vars_agg
}
columnsID(mtcars)
#> [1] "qsec" "mpg"
Created on 2019-04-02 by the reprex package (v0.2.1)
Upvotes: 0
Reputation: 42544
In many cases there is a natural key which uniquely identifies each observation. For instance, the mtcars
data set has unique row names.
library(data.table)
data.table(mtcars, keep.rownames = "id")
id mpg cyl disp hp drat wt qsec vs am gear carb 1: Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2: Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 3: Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4: Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 5: Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 6: Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 ...
If there is no natural key available, I suggest to create an articifial key by simply numbering the rows consecutively and storing it in an additional column:
data.table(mtcars)[, rn := .I][]
mpg cyl disp hp drat wt qsec vs am gear carb rn 1: 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 1 2: 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 2 3: 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 3 4: 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 4 5: 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 5 6: 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 6 ...
Anything else might not be worth the effort, or is at risk that attribute values may become identical, e.g., when they are rounded.
Upvotes: 1
Reputation: 11728
A homemade algorithm: the principle is to greedily take the variable with the most distinct number of elements and then to filter only the remaining rows with duplicates and to iterate. This doesn't give the best solution but it's an easy way to get a rather good solution quickly.
set.seed(1)
mat <- replicate(1000, sample(c(letters, LETTERS), size = 100, replace = TRUE))
library(dplyr)
columnsID <- function(mat) {
df <- df0 <- as_data_frame(mat)
vars <- c()
while(nrow(df) > 0) {
var_best <- names(which.max(lapply(df, n_distinct)))[[1]]
vars <- append(vars, var_best)
df <- group_by_at(df0, vars) %>% filter(n() > 1)
}
vars
}
columnsID(mat)
[1] "V68" "V32"
Upvotes: 2