EngrStudent
EngrStudent

Reputation: 2022

equivalent of melt+reshape that splits on column names

Point: if you are going to vote to close, it is poor form not to give a reason why. If it can be improved without requiring a close, take the 10 seconds it takes to write a brief comment.

Question:
How do I do the following "partial melt" in a way that memory can support?

Details:
I have a few million rows and around 1000 columns. The names of the columns have 2 pieces of information in them.

Normally I would melt to a data frame (or table) comprised of a pair of columns, then I would split on the variable name to create two new columns, then I would cast using one of the new splits for new column names, and one for row names.

This isn't working. My billion or so rows of data are making the additional columns overwhelm my memory.

Outside the "iterative force" (as opposed to brute force) of a for-loop, is there a clean and effective way to do this?

Thoughts:

Update (dummy code):

#libraries
library(stringr)

#reproducibility
set.seed(56873504)

#geometry
Ncol <- 2e3
Nrow <- 1e6

#column names
namelist <- numeric(length=Ncol)
for(i in 1:(Ncol/200)){
  col_idx <- 1:200+200*(i-1)
  if(i<26){
  namelist[col_idx] <- paste0(intToUtf8(64+i),str_pad(string=1:200,width=3,pad="0"))
  } else {
    namelist[col_idx] <- paste0(intToUtf8(96+i),str_pad(string=1:200,width=3,pad="0"))
  }
}

#random data
df <- as.data.frame(matrix(runif(n=Nrow*Ncol,min=0, max=16384),nrow=Nrow,ncol=Ncol))
names(df) <- namelist

The output that I would be looking for would have a column with the first character of the current name (single alphabet character) and colnames would be 1 to 200. It would be much less wide than "df" but not fully melted. It would also not kill my cpu or memory.

(Ugly/Manual) Brute force version:

(working on it... )

Upvotes: 2

Views: 422

Answers (1)

Cole
Cole

Reputation: 11255

Here are two options both using data.table.

If you know that each column string always has 200 (or n) fields associated with it (i.e., A001 - A200), you can use melt() and make a list of measurement variables.

melt(dt
     , measure.vars = lapply(seq_len(Ncol_p_grp), seq.int, to = Ncol_p_grp * n_grp, by = Ncol_p_grp)
     , value.name = as.character(seq_len(Ncol_p_grp))
)[, variable := rep(namelist_letters, each = Nrow)][]

#this data set used Ncol_p_grp <- 5 to help condense the data. 
        variable         1          2         3          4          5
     1:        A 0.2655087 0.06471249 0.2106027 0.41530902 0.59303088
     2:        A 0.3721239 0.67661240 0.1147864 0.14097138 0.55288322
     3:        A 0.5728534 0.73537169 0.1453641 0.45750426 0.59670404
     4:        A 0.9082078 0.11129967 0.3099322 0.80301300 0.39263068
     5:        A 0.2016819 0.04665462 0.1502421 0.32111280 0.26037592
    ---                                                              
259996:        Z 0.5215874 0.78318812 0.7857528 0.61409610 0.67813484
259997:        Z 0.6841282 0.99271480 0.7106837 0.82174887 0.92676493
259998:        Z 0.1698301 0.70759513 0.5345685 0.09007727 0.77255570
259999:        Z 0.2190295 0.14661878 0.1041779 0.96782695 0.99447460
260000:        Z 0.4364768 0.06679642 0.6148842 0.91976255 0.08949571

Alternatively, we can use rbindlist(lapply(...)) to go through the data set and subset it based on the letter within the columns.

rbindlist(
  lapply(namelist_letters,
       function(x) setnames(
         dt[, grep(x, names(dt), value = T), with = F]
         , as.character(seq_len(Ncol_p_grp)))
  )
  , idcol = 'ID'
, use.names = F)[, ID := rep(namelist_letters, each = Nrow)][]

With 78 million elements in this dataset, it takes around a quarter of a second. I tried to up it to 780 million, but I just don't really have the RAM to generate the data that quickly in the first place.

#78 million elements - 10,000 rows * 26 grps * 200 cols_per_group
Unit: milliseconds
             expr      min       lq     mean   median       uq      max neval
      melt_option 134.0395 135.5959 137.3480 137.1523 139.0022 140.8521     3
 rbindlist_option 290.2455 323.4414 350.1658 356.6373 380.1260 403.6147     3

Data: Run this before everything above:

#packages ----
library(data.table)
library(stringr)

#data info
Nrow <- 10000
Ncol_p_grp <- 200
n_grp <- 26

#generate data
set.seed(1)
dt <- data.table(replicate(Ncol_p_grp * n_grp, runif(n = Nrow)))

names(dt) <- paste0(rep(LETTERS[1:n_grp], each = Ncol_p_grp)
                    , str_pad(rep(seq_len(Ncol_p_grp), n_grp), width = 3, pad = '0'))

#first letter
namelist_letters <- unique(substr(names(dt), 1, 1))

Upvotes: 1

Related Questions