C8H10N4O2
C8H10N4O2

Reputation: 221

R data.table way to set column values from functions & joins without a for loop?

Is there a way using R data.table to set column values that would require two levels of function calls to set, and is there a way to set column values in a join between two data.tables?

Example: this works but uses a for loop.

library(data.table)

# use something from datasets to illusrate
tAmounts<-data.table(rbind(cbind(ID="Apple", Amt=as.numeric(EuStockMarkets[1:50,2])),
                           cbind(ID="Orange", Amt=as.numeric(EuStockMarkets[1:30,3])),
                           cbind(ID="Lemon", Amt=as.numeric(EuStockMarkets[1:60,4]))))
setkey(tAmounts, ID, Amt)

# is there a data.table way to do this without the for loops?

# summary table with the full hierarchical cluster for each ID
tSummary<-tAmounts[, .(.N, Clust=list()), keyby="ID"]
for (Id in unique(tAmounts$ID)) {
  D<-dist(tAmounts[ID==Id]$Amt)
  C<-hclust(D, method="average")
  tSummary[ID==Id]$Clust<-list(C) # any way to mapply & lapply?
}
# ID      N      Clust
# Apple  50 <hclust[7]>
# Lemon  60 <hclust[7]>
# Orange 30 <hclust[7]>

Maybe there's a way to say something like tSummary[, Clust:=hcust(dist(Amt), method="average"), by="ID") using some combination of lapply and mapply?

Similarly, is there a way to set the column in a join using a function? Continued from example above:

# table of hierarchical cluster cuts, e.g., height of $20, height of $40
tCuts<-CJ(ID=unique(tAmounts$ID), Cut=seq(20,100,20))
setkey(tCuts, ID, Cut)
# ID    Cut
# Apple  20
# Apple  40
# ...etc...

# table with clusters taken at each cut
tClust<-tCuts[tAmounts, on="ID", allow.cartesian=TRUE]
setkey(tClust, ID, Cut, Amt)
# ID Cut    Amt
# Apple  20 1587.4
# Apple  20 1630.6
# ...etc...
# Orange 100 1789.5 

# set ClustNum for each ID, cut, and amount
for (i in 1:nrow(tCuts)) {
  Id<-tCuts[i]$ID
  tClust[ID==Id & Cut==tCuts[i]$Cut, ClustNum:=cutree(tSummary[ID==Id]$Clust[[1]], h=tCuts[i]$Cut)] # any way to mapply in a join?
}

Is there something like tClust[tCuts, ClustNum:=cutree(Clust, h=Cut)] that can join and set the value all at once?

Upvotes: 1

Views: 151

Answers (1)

Tobo
Tobo

Reputation: 754

1: Table of counts and hclust objects, one for each ID

The solution here is just to wrap the RHS of the j-expression inside list().

tSummary <- 
  tAmounts[, .(.N, Clust = list(hclust(dist(Amt), method="average"))), by="ID"]

2: Clusters at heights 20, 40, ..., 100

The ClustNum column of cluster tags can be created from 1 directly (without the cross-joins and intermediate tables). by=ID takes care of the iteration over IDs, while to iterate over heights without looping, we can either use an apply function, or use the fact that cutree() accepts a vector of heights.

cuts <- seq(20,100,20)

# EITHER lapply over heights (and create the Cut column "manually"):
tClust <-
  tSummary[, .(Cut = rep(cuts, each=N),
               ClustNum = unlist(lapply(cuts, function(h) cutree(Clust[[1]], h=h)))),
           by=ID]

# OR pass the vector of heights to cutree():
tClust <-
  tSummary[, melt(as.data.table(cutree(Clust[[1]], h=cuts)),
                  variable.name="Cut", variable.factor=F, value.name="ClustNum"),
           by=ID]

# Add the amounts column
tClust[, Amt := tAmounts[, rep(Amt, times=length(cuts)), by=ID][, ID := NULL]]

Example data (ensuring Amt column numeric):

library(data.table)
tAmounts <- rbind(
      data.frame(ID="Apple", Amt=as.numeric(EuStockMarkets[1:50,2])),
      data.frame(ID="Orange", Amt=as.numeric(EuStockMarkets[1:30,3])),
      data.frame(ID="Lemon", Amt=as.numeric(EuStockMarkets[1:60,4]))
    ) |> setDT()
setkey(tAmounts, ID)

Thanks to @r2evans for help

Upvotes: 3

Related Questions