Reputation: 221
Is there a way using R data.table to set column values that would require two levels of function calls to set, and is there a way to set column values in a join between two data.tables?
Example: this works but uses a for loop.
library(data.table)
# use something from datasets to illusrate
tAmounts<-data.table(rbind(cbind(ID="Apple", Amt=as.numeric(EuStockMarkets[1:50,2])),
cbind(ID="Orange", Amt=as.numeric(EuStockMarkets[1:30,3])),
cbind(ID="Lemon", Amt=as.numeric(EuStockMarkets[1:60,4]))))
setkey(tAmounts, ID, Amt)
# is there a data.table way to do this without the for loops?
# summary table with the full hierarchical cluster for each ID
tSummary<-tAmounts[, .(.N, Clust=list()), keyby="ID"]
for (Id in unique(tAmounts$ID)) {
D<-dist(tAmounts[ID==Id]$Amt)
C<-hclust(D, method="average")
tSummary[ID==Id]$Clust<-list(C) # any way to mapply & lapply?
}
# ID N Clust
# Apple 50 <hclust[7]>
# Lemon 60 <hclust[7]>
# Orange 30 <hclust[7]>
Maybe there's a way to say something like tSummary[, Clust:=hcust(dist(Amt), method="average"), by="ID")
using some combination of lapply and mapply?
Similarly, is there a way to set the column in a join using a function? Continued from example above:
# table of hierarchical cluster cuts, e.g., height of $20, height of $40
tCuts<-CJ(ID=unique(tAmounts$ID), Cut=seq(20,100,20))
setkey(tCuts, ID, Cut)
# ID Cut
# Apple 20
# Apple 40
# ...etc...
# table with clusters taken at each cut
tClust<-tCuts[tAmounts, on="ID", allow.cartesian=TRUE]
setkey(tClust, ID, Cut, Amt)
# ID Cut Amt
# Apple 20 1587.4
# Apple 20 1630.6
# ...etc...
# Orange 100 1789.5
# set ClustNum for each ID, cut, and amount
for (i in 1:nrow(tCuts)) {
Id<-tCuts[i]$ID
tClust[ID==Id & Cut==tCuts[i]$Cut, ClustNum:=cutree(tSummary[ID==Id]$Clust[[1]], h=tCuts[i]$Cut)] # any way to mapply in a join?
}
Is there something like tClust[tCuts, ClustNum:=cutree(Clust, h=Cut)]
that can join and set the value all at once?
Upvotes: 1
Views: 151
Reputation: 754
1: Table of counts and hclust
objects, one for each ID
The solution here is just to wrap the RHS of the j-expression inside list()
.
tSummary <-
tAmounts[, .(.N, Clust = list(hclust(dist(Amt), method="average"))), by="ID"]
2: Clusters at heights 20, 40, ..., 100
The ClustNum
column of cluster tags can be created from 1 directly (without the cross-joins and intermediate tables). by=ID
takes care of the iteration over IDs, while to iterate over heights without looping, we can either use an apply
function, or use the fact that cutree()
accepts a vector of heights.
cuts <- seq(20,100,20)
# EITHER lapply over heights (and create the Cut column "manually"):
tClust <-
tSummary[, .(Cut = rep(cuts, each=N),
ClustNum = unlist(lapply(cuts, function(h) cutree(Clust[[1]], h=h)))),
by=ID]
# OR pass the vector of heights to cutree():
tClust <-
tSummary[, melt(as.data.table(cutree(Clust[[1]], h=cuts)),
variable.name="Cut", variable.factor=F, value.name="ClustNum"),
by=ID]
# Add the amounts column
tClust[, Amt := tAmounts[, rep(Amt, times=length(cuts)), by=ID][, ID := NULL]]
Example data (ensuring Amt
column numeric):
library(data.table)
tAmounts <- rbind(
data.frame(ID="Apple", Amt=as.numeric(EuStockMarkets[1:50,2])),
data.frame(ID="Orange", Amt=as.numeric(EuStockMarkets[1:30,3])),
data.frame(ID="Lemon", Amt=as.numeric(EuStockMarkets[1:60,4]))
) |> setDT()
setkey(tAmounts, ID)
Thanks to @r2evans for help
Upvotes: 3