Kevin Lee
Kevin Lee

Reputation: 351

How to determine Primary Key?

What Iv been taught so far is that to find the primary key, find the combination of columns/variables whose values each match only one observation. So, no one observation in the primary key matches to more than one observation in the data set.

Say you find the primary key consisting of columns A, W, X, Y, Z, because when you group by them, every row in the dataset is unique.

But then it turns out that you could have just grouped A, W, X, and Z.

My problem is illustrated below where both of the column groupings yield distinct values, but it turns out I didnt need to include Salary.

Lahman::Salaries %>% count(playerID, yearID, salary, teamID) %>% filter(n>1)
 A tibble: 0 x 5
 … with 5 variables: playerID <chr>, yearID <int>, salary <int>, teamID <fct>,
   n <int>

Lahman::Salaries %>% count(playerID, yearID, teamID) %>% filter(n>1)
 A tibble: 0 x 4
 … with 4 variables: playerID <chr>, yearID <int>, teamID <fct>, n <int>

How do you know which columns you dont need as you go along checking if the next column, when included, gives you the primary key?

Upvotes: 1

Views: 574

Answers (1)

r2evans
r2evans

Reputation: 160687

Here's a function that makes no pretenses about being efficient with large data.

#' @param x 'data.frame'
#' @param max.columns 'integer', the max number of columns to consider
#'   for a key; with most data, this might be at most 'ncol(x)-1'
#'   (assuming that there is some numeric or "data" field)
#' @param find 'integer', how many column combinations to look for; if
#'   1 (default), then the first combination is returned, no others
#'   attempted; if below 1, all combinations (up through
#'   'max.columns') are considered
#' @param digits 'integer', how to deal with 'numeric' non-'integer'
#'   columns; if 'NA' (default), do not consider said columns; if a
#'   non-negative integer, the number of digits to round the columns
#'   and factorize; note: setting this too low may "find" keys in
#'   non-trivial-precision floating-point numbers
#' @return 'list' of 'character', combinations of columns that unique
#'   identify the rows given
findkeys <- function(x, max.columns = ncol(x) - 1L, find = 1L, digits = NA) {
  isnum <- sapply(x, function(z) is.numeric(z) & !is.integer(z))
  force(max.columns)
  if (is.na(digits) || digits < 0) {
    x <- x[, !isnum, drop = FALSE ]
  } else {
    possfactors <- lapply(x[,isnum], function(z) as.character(z - trunc(z)))
    signif_digits <- sapply(possfactors, function(s) max(nchar(gsub("^0\\.?", "", gsub("(0{5,}|9{5,}).*", "", s)))))
    remove <- signif_digits[ signif_digits > digits ]
    message("#> Removing: ", paste(sQuote(names(remove)), collapse = ","))
    x <- x[, setdiff(colnames(x), names(remove)), drop = FALSE]
  }
  max.columns <- min(c(max.columns, ncol(x)))
  keys <- list()
  if (max.columns < 1) {
    warning("no columns found, no keys")
    return(keys)
  }
  message("#> Columns : ", paste(sQuote(colnames(x)), collapse = ","))
  for (i in seq_len(max.columns)) {
    combs <- combn(names(x), i)
    for (j in seq_len(ncol(combs))) {
      uniq <- unique(x[, combs[,j], drop = FALSE ])
      if (nrow(x) == nrow(uniq)) {
        keys <- c(keys, list(combs[,j]))
        if (find > 0 && length(keys) >= find) return(keys)
      }
    }
  }
  keys
}

Execution:

findkeys(mtcars)
# Warning in findkeys(mtcars) : no columns found, no keys
# list()
findkeys(mtcars, digits=1)
# #> Removing: 'drat','wt','qsec'
# #> Columns : 'mpg','cyl','disp','hp','vs','am','gear','carb'
# list()
findkeys(mtcars, digits=2)
# #> Removing: 'wt'
# #> Columns : 'mpg','cyl','disp','hp','drat','qsec','vs','am','gear','carb'
# [[1]]
# [1] "mpg"  "qsec"
findkeys(tibble::rownames_to_column(mtcars))
# #> Columns : 'rowname'
# [[1]]
# [1] "rowname"
findkeys(Lahman::Salaries)
# #> Columns : 'yearID','teamID','lgID','playerID','salary'
# [[1]]
# [1] "yearID"   "teamID"   "playerID"
findkeys(Lahman::Salaries, find = 0)
# #> Columns : 'yearID','teamID','lgID','playerID','salary'
# [[1]]
# [1] "yearID"   "teamID"   "playerID"
# [[2]]
# [1] "yearID"   "teamID"   "lgID"     "playerID"
# [[3]]
# [1] "yearID"   "teamID"   "playerID" "salary"  

This is imperfect, certainly. For instance, I highly doubt that we would want to use "salary" as a key for Lahman::Salaries.

It could certainly be made smarter, but as heuristics go, this is fast, likely flawed, and good-enough for the moment. When using this, you might consider removing all known-floating-point numbers before this step, and perhaps down-sample large data ... knowing that sampling may present false-positives.

Upvotes: 1

Related Questions