Reputation: 71

Concatenating hexbin objects (or other methods to iteratively construct hexbin plots with extremely large datasets)

I have a dataset with 12,000 participants. Each participant is associated with two 360x360 matrices (one matrix representing distance between two brain regions, and another showing a correlation representing the connectedness of the two brain regions over time). I want to plot distance x correlation for all regions, combining all participants, in one plot. That's 1,555,200,000 data points, and R won't assign a vector for even one axis of that size. Ideally I'd also like to plot a best fit line over it.

The strategy I'm attempting is to calculate the number of points in a series of hex bins for each participant using the hexbin package, and then sum them iteratively so that I have counts for each bin across all participants. However, I cannot figure out how to do the sum operation, as there is no concatenate method in the hexbin package (or equivalent functionality anywhere else I've looked).

So basically I want to do something like this:

library(mc2d)
library(hexbin)
N=10000
x1<-rpert(N,0,2,4,shape=5)
y1<-rpert(N,2,8,10,shape=5)
x2<-rpert(N,6,8,10,shape=5)
y2<-rpert(N,0,2,8,shape=5)
xc<-c(x1,x2)
yc<-c(y1,y2)

h1<-hexbin(x1,y1,xbnds=c(0,10),ybnds=c(0,10),xbins=100,shape=.75)
h2<-hexbin(x2,y2,xbnds=c(0,10),ybnds=c(0,10),xbins=100,shape=.75)
hc<-hexbin(xc,yc,xbnds=c(0,10),ybnds=c(0,10),xbins=100,shape=.75)

plot(hc)

Except that I want to generate hc from h1 and h2, instead of from the component vectors (because those would be too large to hold in memory for my actual application). I'm open to using python or another language if it would get the job done.

Upvotes: 2

Answers (2)

mrpeverill

Reputation: 71

Ben's answer is very close, but xcm and ycm in the hexbin object are the center of mass and are not unique to the cell, so merging on them produces duplicates incorrectly if the data overlaps at all. A key piece of information is that the cell IDs are unique to a specific hex in the grid defined by the bounding information (you can find this out by comparing overlapping ranges of cell ids in hexbins that do and do not overlap -- or by looking at x and y coordinates output by the hcell2xy function). So, as long as the bounds of the two hexbins are identical, you can simply merge on cellID.

Problem restatement with overlapping data:

N=10000
x1<-rpert(N,0,2,4,shape=5)
y1<-rpert(N,2,8,10,shape=5)
x2<-rpert(N,0,5,10,shape=5)
y2<-rpert(N,0,5,10,shape=5)


h1<-hexbin(x1,y1,xbnds=c(0,10),ybnds=c(0,10),xbins=100,shape=.75)
h2<-hexbin(x2,y2,xbnds=c(0,10),ybnds=c(0,10),xbins=100,shape=.75)

Solution (adapted from Ben's):

# Get elements from s4 object by name
get_slots <- function(x,nm) Map(\(c) getElement(x, c), nm)

# Unpack hexbin data to be merged in to a dataframe
# Strictly speaking we don't need the xy coordinates, but it is a good error
# check if we have the computation time available.
unpack_hexbin <- function(x) {
  cols <- c("cell", "count", "xcm", "ycm")
  return(cbind(data.frame(get_slots(x,cols)),
                 hcell2xy(x)))
}

# Get columns from a dataframe that should not vary between hexbins to be 
# merged.
getmeta_hexbin <- function(x) {
  varying=c("cell", "count", "xcm", "ycm", "call", "n", "ncells")
  other_slots <- setdiff(slotNames(x), varying)
  get_slots(x,other_slots)
}

# Center of mass calculation for two points, robust to missing data. 
cm<-function(x1,x2,x1w,x2w) {
  i<-x1*x1w
  j<-x2*x2w
  w<-sum(x1w,x2w,na.rm=TRUE)
  return(sum(i,j,na.rm=TRUE)/w)
}

combine_hexbin <- function(a,b) {
    hm <- merge(unpack_hexbin(a), 
                unpack_hexbin(b), 
                by = c("cell","x","y"), 
                all = TRUE)
    if(any(duplicated(hm$cell))) stop("Duplicate cell Id's detected: Do the hexbin objects have the same grid?")
    hm2 <- hm %>% rowwise() %>% mutate(
      count=sum(count.x,count.y,na.rm=TRUE),
      xcm=cm(xcm.x,xcm.y,count.x,count.y),
      ycm=cm(ycm.x,ycm.y,count.x,count.y)
    )
    do.call(new,
            c(list("hexbin"),
              as.list(hm2[,c("cell",
                             "count",
                             "xcm",
                             "ycm")]),
              list(n = sum(hm2$count),
                   ncells = length(hm2)),
              getmeta_hexbin(a),
              call = quote(call("merged hexbin", 1))
            ))
}

plot(combine_hexbin(h1,h2))

Upvotes: 0

Ben Bolker

Reputation: 226182

I think this works:

unpack hexbin object

extract either values (as columns of a data frame) or metadata

unpack_hexbin <- function(x, element = c("cols", "metadata")) {
    element <- match.arg(element)
    get_slots <- function(nm) Map(\(c) getElement(x, c), nm)
    cols <- c("cell", "count", "xcm", "ycm")
    if (element == "cols") return(get_slots(cols) |>
                                  do.call(what = "data.frame"))
    other_slots <- setdiff(slotNames(x), c(cols, "call", "n", "ncells"))
    get_slots(other_slots)
}

combine

Take a list of hexbin objects.

Extract and merge the data columns
Sum the values associated with each cell, skipping NAs
Extract the metadata from the first hexbin object (assuming they are all consistent!)
Put the pieces back together

combine_hexbin <- function(L) {
    h <- Map(unpack_hexbin, L)
    h2 <- Reduce(\(x,y) 
        merge(x, y, by = c("cell", "xcm", "ycm"), all = TRUE),
           h)
    comb <- apply(h2[,-(1:3)], 1, sum, na.rm = TRUE)
    do.call(new,
            c(list("hexbin"),
              as.list(h2[,1:3]),
              list(count = comb,
                   n = sum(comb),
                   ncells = length(comb)),
              unpack_hexbin(L[[1]], "metadata"),
              call = quote(call("junk", 1))
            ))
}

try it out

With values from example above:

L <- list(h1, h2)
cc <- combine_hexbin(L)
plot(cc)