Berthrand Eros
Berthrand Eros

Reputation: 105

Exchange data.table columns with most prevalent value of columns

I have data

test = data.table(
  a = c(1,1,3,4,5,6), 
  b = c("a", "be", "a", "c", "d", "c"), 
  c = rep(1, 6)
)

I wish to take the unique values of column a, store it in another data.table, and afterwards fill in the remaining columns with the most prevalent values of those remaining columns, such that my resulting data.table would be:

test2 = data.table(a = c(1,3,4,5,6), b = "a", c = 1)

Column be has equal amounts of "a" and "c", but it doesn't matter which is chosen in those cases.

Attempt so far:

test2 = unique(test, by = "a")
test2[, c("b", "c") := lapply(.SD, FUN = function(x){test2[, .N, by = x][order(-N)][1,1]}), .SDcols = c("b", "c")]

EDIT: I would preferrably like a generic solution that is compatible with a function where I specify the column to be "uniqued", and the rest of the columns are with the single most prevalent value. Hence my use of lapply and .SD =)

EDIT2: as @MichaelChirico points out, how do we keep the class. With the following data.table some of the solutions does not work, although solution of @chinsoon12 does work:

test = data.table(a = c(1,1,3,4,5,6), 
                  b = c("a", "be", "a", "c", "d", "c"), 
                  c = rep(1, 6),
                  d = as.Date("2019-01-01"))

Upvotes: 3

Views: 98

Answers (5)

Armali
Armali

Reputation: 19375

Here's another variant which one might find less sophisticated, yet more readable. It's essentially chinsoon12's rowid approach generalized for any number of columns. Also the classes are kept.

test = data.table(a = c(1,1,3,4,5,6), 
                  b = c("a", "be", "a", "c", "d", "c"), 
                  c = rep(1, 6),
                  d = as.Date("2019-01-01"))
test2 = unique(test, by = "a")
for (col in setdiff(names(test2), "a")) test2[[col]] = test2[[col]][which.max(rowid(test2[[col]]))]

Upvotes: 1

MichaelChirico
MichaelChirico

Reputation: 34703

@EmreKiratli is very close to what I would do:

test[ , c(
  list(a = unique(a)),
  lapply(.SD, function(x) as(tail(names(sort(table(x))), 1L), class(x)))
), .SDcols = !'a']

The as(., class(x)) part is because names in R are always character, so we have to convert back to the original class of x.

You might like this better in magrittr form since it's many nested functions:

library(magrittr)
test[ , c(
  list(a = unique(a)),
  lapply(.SD, function(x) {
    table(x) %>% sort %>% names %>% tail(1L) %>% as(class(x))
  })
), .SDcols = !'a']

Upvotes: 2

chinsoon12
chinsoon12

Reputation: 25225

Another option:

dtmode <- function(x) x[which.max(rowid(x))]
test[, .(A=unique(A), B=dtmode(B), C=dtmode(C))] 

data:

test = data.table(
    A = c(1,1,3,4,5,6), 
    B = c("a", "be", "a", "c", "d", "c"), 
    C = rep(1, 6)
)

Upvotes: 3

Berthrand Eros
Berthrand Eros

Reputation: 105

I was able to make an OK solution, but if somebody can do it more elegantly, for example not going through the step of storting a list in refLevel below, please let me know! I'm very interested in learning data.table properly!

#solution:
test = data.table(a = c(1,1,3,4,5,6), b = c("a", "be", "a", "c", "d", "c"), c = rep(1, 6))
test2 = unique(test, by="a")
funPrev = function(x){unlist(as.data.table(x)[, .N, by=x][order(-N)][1,1], use.names = F)}
refLevel = lapply(test[, c("b", "c")], funPrev)
test2[, c("b", "c") := refLevel]

...and using a function (if anybody see any un-necessary step, please let me know):

genData = function(dt, var_unique, vars_prev){

  data = copy(dt)
  data = unique(data, by = var_unique)

  funPrev = function(x){unlist(as.data.table(x)[, .N, by=x][order(-N)][1,1], use.names = F)}

  refLevel = lapply(dt[, .SD, .SDcols = vars_prev], funPrev)
  data[, (vars_prev) := refLevel] 

  return(data)
}
test2 = genData(test, "a", c("b", "c"))

Upvotes: 1

aekiratli
aekiratli

Reputation: 538

Not a clean way to do this but it works.

test = data.frame(a = c(1,1,3,4,5,6), b = c("a", "be", "a", "c", "d", "c"), c = rep(1, 6))

a = unique(test$a)
b = tail(names(sort(table(test$b))), 1)
c = tail(names(sort(table(test$c))), 1)

test2 = cbind(a,b,c)

Output is like this:

> test2
     a   b   c  
[1,] "1" "c" "1"
[2,] "3" "c" "1"
[3,] "4" "c" "1"
[4,] "5" "c" "1"
[5,] "6" "c" "1"
> 

Upvotes: 2

Related Questions