User981636
User981636

Reputation: 3621

Creating a unique ID variable as combination of variables

I have a data frame (df) or data table (dt) with, let’s say 1000 variables and 1000 observations. I checked that there are no duplicates in the observations, so dt[!duplicated(dt)] has the same length as the original file.

I would like to create an ID variable for all this observation with a combination of some of the 1000 variables I have. Differently to other SO questions as I don’t know which variables are more suitable to create the ID and it is likely that I need a combination of, at least, 3 or 4 variables.

Is there any package/function in R that could get me the most efficient combination of variables to create an ID variable? In my real example I am struggling to create an ID manually, and probably it is not the best combination of variables.

Example with mtcars:

require(data.table)
example <- data.table(mtcars)
rownames(example) <- NULL # Delete mtcars row names
example <- example[!duplicated(example),]
example[,id_var_wrong := paste0(mpg,"_",cyl)]
length(unique(example$id_var_wrong)) # Wrong ID, there are only 27 different values for this variable despite 32 observations

example[,id_var_good := paste0(wt,"_",qsec)]
length(unique(example$id_var_good)) # Good ID as there are equal number of unique values as different observations.

Is there any function to find wt and qsec automatically and not manually?

Upvotes: 4

Views: 3225

Answers (3)

Moohan
Moohan

Reputation: 1011

Based on @F. Privé's answer; You can optionally specify a 'startVar' if you have an idea of what might be best, otherwise to start it will just pick the var with the max number of distinct values.

library(dplyr)
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

columnsID <- function(dataset,
                      startVar = NULL,
                      frac = 1) {
  #Set up some temporary dataframes
  #Remove any total duplicates with distinct
  #Take a sample if not working on the full data
  tibb <- as_tibble(dataset) %>%
    distinct() %>%
    sample_frac(frac)

  #Set up the vars which will be used
  if (is.null(startVar)) {
    startVar <- names(which.max(lapply(tibb, n_distinct)))[[1]]
  }
  vars_agg <- c(startVar)
  vars_all <- names(tibb)

  #Filter out any rows which are already uniquely identified
  tibb <- tibb %>% group_by_at(vars(vars_agg)) %>%
    filter(n() > 1)

  while (nrow(tibb) > 0) {
    #Keep track of the vars we haven't used yet
    vars_unused <- setdiff(vars_all, vars_agg)

    #Find the variable which has the most distinct number of values on average
    #for the grouping we have so far
    var_best <-
      tibb %>%
      group_by(!!!syms(vars_agg)) %>%
      mutate_at(vars(vars_unused), funs(n_distinct(.))) %>%
      ungroup() %>%
      summarise_at(vars(vars_unused), funs(mean)) %>%
      which.max() %>%
      names()

    #Add the 'best variable' to the list
    vars_agg <- c(vars_agg, var_best)

    #Filter out any rows which are now uniquely identified
    tibb <- tibb %>%
      group_by_at(vars(vars_agg)) %>%
      filter(n() > 1)
  }
  vars_agg
}

columnsID(mtcars)
#> [1] "qsec" "mpg"

Created on 2019-04-02 by the reprex package (v0.2.1)

Upvotes: 0

Uwe
Uwe

Reputation: 42544

In many cases there is a natural key which uniquely identifies each observation. For instance, the mtcars data set has unique row names.

library(data.table)
data.table(mtcars, keep.rownames = "id")
                     id  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
 1:           Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
 2:       Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
 3:          Datsun 710 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
 4:      Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
 5:   Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
 6:             Valiant 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
 ...

If there is no natural key available, I suggest to create an articifial key by simply numbering the rows consecutively and storing it in an additional column:

data.table(mtcars)[, rn := .I][]
     mpg cyl  disp  hp drat    wt  qsec vs am gear carb rn
 1: 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4  1
 2: 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4  2
 3: 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1  3
 4: 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1  4
 5: 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2  5
 6: 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1  6
 ...

Anything else might not be worth the effort, or is at risk that attribute values may become identical, e.g., when they are rounded.

Upvotes: 1

F. Priv&#233;
F. Priv&#233;

Reputation: 11728

A homemade algorithm: the principle is to greedily take the variable with the most distinct number of elements and then to filter only the remaining rows with duplicates and to iterate. This doesn't give the best solution but it's an easy way to get a rather good solution quickly.

set.seed(1)
mat <- replicate(1000, sample(c(letters, LETTERS), size = 100, replace = TRUE))

library(dplyr)

columnsID <- function(mat) {
  df <- df0 <- as_data_frame(mat)
  vars <- c()
  while(nrow(df) > 0) {
    var_best <- names(which.max(lapply(df, n_distinct)))[[1]]
    vars <- append(vars, var_best)
    df <- group_by_at(df0, vars) %>% filter(n() > 1)
  }
  vars
}

columnsID(mat)
[1] "V68" "V32"

Upvotes: 2

Related Questions