Reputation: 5144

How to delete a row by reference in data.table?

My question is related to assignment by reference versus copying in data.table. I want to know if one can delete rows by reference, similar to

DT[ , someCol := NULL]

I want to know about

DT[someRow := NULL, ]

I guess there's a good reason for why this function doesn't exist, so maybe you could just point out a good alternative to the usual copying approach, as below. In particular, going with my favourite from example(data.table),

DT = data.table(x = rep(c("a", "b", "c"), each = 3), y = c(1, 3, 6), v = 1:9)
#      x y v
# [1,] a 1 1
# [2,] a 3 2
# [3,] a 6 3
# [4,] b 1 4
# [5,] b 3 5
# [6,] b 6 6
# [7,] c 1 7
# [8,] c 3 8
# [9,] c 6 9

Say I want to delete the first row from this data.table. I know I can do this:

DT <- DT[-1, ]

but often we may want to avoid that, because we are copying the object (and that requires about 3*N memory, if N object.size(DT), as pointed out here. Now I found set(DT, i, j, value). I know how to set specific values (like here: set all values in rows 1 and 2 and columns 2 and 3 to zero)

set(DT, 1:2, 2:3, 0) 
DT
#      x y v
# [1,] a 0 0
# [2,] a 0 0
# [3,] a 6 3
# [4,] b 1 4
# [5,] b 3 5
# [6,] b 6 6
# [7,] c 1 7
# [8,] c 3 8
# [9,] c 6 9

But how can I erase the first two rows, say? Doing

set(DT, 1:2, 1:3, NULL)

sets the entire DT to NULL.

My SQL knowledge is very limited, so you guys tell me: given data.table uses SQL technology, is there an equivalent to the SQL command

DELETE FROM table_name
WHERE some_column=some_value

in data.table?

Upvotes: 176

Answers (8)

ricewhitlam

Reputation: 127

EDIT: Probably avoid this because it appears to be related to this issue: https://github.com/Rdatatable/data.table/issues/3745

Not sure this is a particularly useful answer (for at least two reasons), but figured I would add into the mix. The two reasons this answer is of questionable utility are (1) I would not expect any performance benefits (cpu or ram) over standard data.table operations (spoilers: because it uses standard operations) and (2) because it is not a true data.table solution. I use Rcpp as well.

First, we need a simple Rcpp function that replaces the columns of one data.table with another by reference:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export(change_df_by_reference)]]
void change_df_by_reference(DataFrame& DT, const DataFrame& new_DT){ 
    int n = DT.size();
    for(int i = 0; i < n; ++i){
        DT[i] = new_DT[i];
    }
}

Next, we need to create the R wrapper functions. The below seem to work well-enough, but I haven't tested them exhaustively and I doubt that they are perfect:

keeprows <- function(DT, ...){
    if(!("data.table" %in% class(DT))){
        stop("Argument 'DT' must be a data.table")
    }
    k <- data.table::key(DT)
    change_df_by_reference(DT, DT[...])
    if(!is.null(k)){
        data.table::setkeyv(DT, k)
    } 
    return(NULL)
}

removerows <- function(DT, ...){
    if(!("data.table" %in% class(DT))){
      stop("Argument 'DT' must be a data.table")
    }
    k <- data.table::key(DT)
    indices.to.remove <- DT[, TEMP.INDEX.FOR.REMOVAL := .I][..., -TEMP.INDEX.FOR.REMOVAL]
    DT[, TEMP.INDEX.FOR.REMOVAL := NULL]
    change_df_by_reference(DT, DT[indices.to.remove])
    if(!is.null(k)){
      data.table::setkeyv(DT, k)
    } 
    return(NULL)
}

addrows <- function(DT, rows, bottom = TRUE){
    if(!("data.table" %in% class(DT))){
        stop("Argument 'DT' must be a data.table")
    }
    if(!("data.table" %in% class(rows))){
        stop("Argument 'rows' must be a data.table")
    }
    if(!is.logical(bottom)){
        stop("Argument 'bottom' must be a length 1 logical")
    }
    if(length(bottom) != 1 || is.na(bottom)){
        stop("Argument 'bottom' must be a length 1 logical")
    }
    column.names <- colnames(DT)
    rows.column.names <- colnames(rows)
    if(length(column.names) != length(rows.column.names)){
      stop("Columns of arguments 'DT' and 'rows' must match")
    }
    if(!all(column.names %in% rows.column.names) || !all(rows.column.names %in% column.names)){
      stop("Column names of arguments 'DT' and 'rows' must match")
    }
    data.table::setcolorder(rows, column.names)
    column.types <- sapply(DT, class)
    rows.column.types <- sapply(rows, class)
    if(!all(column.types == rows.column.types)){
      stop("Column types of arguments 'DT' and 'rows' must match")
    }
    k <- data.table::key(DT)
    if(bottom){
        change_df_by_reference(DT, data.table::rbindlist(list(DT, rows)))
    } else{
        change_df_by_reference(DT, data.table::rbindlist(list(rows, DT)))
    }
    if(!is.null(k)){
        data.table::setkeyv(DT, k)
    } 
    return(NULL)
}

Upvotes: 0

Lazy

Reputation: 148

This is a version that is inspired by the versions by vc273 and user7114184. When we want to delete "by-reference" we do not want to need to create a new DT for this. But this is in fact not necessary: If we remove all columns from a data table it will become a null data table, which will allow any number of rows. So instead of shifting the columns to a new data table and continuing with that we can actually just shift the columns back to the original data table, and keep using it.

This gives us two functions, one data_table_add_rows which allows us to add "by-reference" additional rows to a data.table. The other one data_table_remove_rows removes rows "by-reference". The first takes a list of values, while the second will evaluate a DT-call for filtering which allows us to do nice things.

#' Add rows to a data table in a memory efficient, by-referencesque manner
#'
#' This mimics the by-reference functionality `DT[, new_col := value]`, but
#' for rows instead. The rows in question are assigned at the end of the data
#' table. If the data table is keyed it is automatically reordered after the
#' operation. If not this function will preserve order of existing rows, but
#' will not preserve sortedness.
#'
#' This function will take the rows to add from a list of columns or generally
#' anything that can be named and converted or coerced to data frame.
#' The list may specify less columns than present in the data table. In this
#' case the rest is filled with NA. The list may not specify more columns than
#' present in the data table. Columns are matched by names if the list is named
#' or by position if not. The list may not have names not present in the data
#' table.
#'
#' Note that this operation is memory efficient as it will add the rows for
#' one column at a time, only requiring reallocation of single columns at a
#' time. This function will change the original data table by reference.
#'
#' This function will not affect shallow copies of the data table.
#'
#' @param .dt A data table
#' @param value A list (or a data frame). Must have at most as many elements as
#'        there are columns in \param{.dt}. If unnamed this will be applied to
#'        first columns in \param{.dt}, else it will by applied by name. Must
#'        not have names not present in \param{.dt}.
#' @return \param{.dt} (invisible)
data_table_add_rows <- function(.dt, value) {
  if (length(value) > ncol(.dt)) {
    rlang::abort(glue::glue("Trying to update data table with {ncol(.dt)
      } columns with {length(value)} columns."))
  }
  if (is.null(names(value))) names(value) <- names(.dt)[seq_len(length(value))]
  value <- as.data.frame(value)
  if (any(!(names(value) %in% names(.dt)))) {
    rlang::abort(glue::glue("Trying to update data table with columns {
        paste(setdiff(names(value), names(.dt)), collapse = ', ')
      } not present in original data table."))
  }
  value[setdiff(names(.dt), names(value))] <- NA
  
  k <- data.table::key(.dt)
  
  temp_dt <- data.table::data.table()
  
  for (col in c(names(.dt))) {
    set(temp_dt, j = col,value = c(.dt[[col]], value[[col]]))
    set(.dt, j = col, value = NULL)
  }
  
  for (col in c(names(temp_dt))) {
    set(.dt, j = col, value = temp_dt[[col]])
    set(temp_dt, j = col, value = NULL)
  }
  
  if (!is.null(k)) data.table::setkeyv(.dt, k)
  
  .dt
}

#' Remove rows from a data table in a memory efficient, by-referencesque manner
#'
#' This mimics the by-reference functionality `DT[, new_col := NULL]`, but
#' for rows instead. This operation preserves order. If the data table is keyed
#' it will preserve the key.
#'
#' This function will determine the rows to delete by passing all additional
#' arguments to a data.table filter call of the form
#' \code{DT[, .idx = .I][..., j = .idx]}
#' Thus we can pass a simple index vector or a condition, or even delete by
#' using join syntax \code{data_table_remove_rows(DT1, DT2, on = cols)} (or
#' reversely keep by join using
#' \code{data_table_remove_rows(DT1, !DT2, on = cols)}
#'
#' Note that this operation is memory efficient as it will add the rows for
#' one column at a time, only requiring reallocation of single columns at a
#' time. This function will change the original data table by reference.
#'
#' This function will not affect shallow copies of the data table.
#'
#' @param .dt A data table
#' @param ... Any arguments passed to `[` for filtering the data.table. Must not
#'        specify `j`.
#' @return \param{.dt} (invisible)
data_table_remove_rows <- function(.dt, ...) {
  k <- data.table::key(.dt)
  
  env <- parent.frame()
  args <- as.list(sys.call()[-1])
  if (!is.null(names(args)) && ".dt" %in% names(args)) args[.dt] <- NULL
  else args <- args[-1]
  
  if (!is.null(names(args)) && "j" %in% names(args)) {
    rlang::abort("... must not specify j")
  }
  
  call <- substitute(
    .dt[, .idx := .I][j = .idx],
    env = list(.dt = .dt))
  
  .nc <- names(call)
  
  for (i in seq_along(args)) {
    call[[i + 3]] <- args[[i]]
  }
  
  if (!is.null(names(args))) names(call) <- c(.nc, names(args))
  which <- eval(call, envir = env)
  set(.dt, j = ".idx", value = NULL)
  
  temp_dt <- data.table::data.table()
  
  for (col in c(names(.dt))) {
    set(temp_dt, j = col,value = .dt[[col]][-which])
    set(.dt, j = col, value = NULL)
  }
  
  for (col in c(names(temp_dt))) {
    set(.dt,j = col, value = temp_dt[[col]])
    set(temp_dt, j = col, value = NULL)
  }
  
  if (!is.null(k)) data.table::setattr(.dt, "sorted", k)
  
  .dt
}

Now this allows us to do quite nice calls. For example we can do:

library(data.table)

d <- data.table(x = 1:10, y = runif(10))

#>         x          y
#>     <int>      <num>
#>  1:     1 0.77326131
#>  2:     2 0.88699627
#>  3:     3 0.15553784
#>  4:     4 0.71221778
#>  5:     5 0.11964578
#>  6:     6 0.73692709
#>  7:     7 0.05382835
#>  8:     8 0.61129007
#>  9:     9 0.18292229
#> 10:    10 0.22569555

# add some rows (y = NA)
data_table_add_rows(d, list(x=11:13))
# add some rows (y = 0)
data_table_add_rows(d, list(x=14:15, y = 0))

#>         x          y
#>     <int>      <num>
#>  1:     1 0.77326131
#>  2:     2 0.88699627
#>  3:     3 0.15553784
#>  4:     4 0.71221778
#>  5:     5 0.11964578
#>  6:     6 0.73692709
#>  7:     7 0.05382835
#>  8:     8 0.61129007
#>  9:     9 0.18292229
#> 10:    10 0.22569555
#> 11:    11         NA
#> 12:    12         NA
#> 13:    13         NA
#> 14:    14 0.00000000
#> 15:    15 0.00000000

# remove all added rows
data_table_remove_rows(d, is.na(y) | y == 0)

#>         x          y
#>     <int>      <num>
#>  1:     1 0.77326131
#>  2:     2 0.88699627
#>  3:     3 0.15553784
#>  4:     4 0.71221778
#>  5:     5 0.11964578
#>  6:     6 0.73692709
#>  7:     7 0.05382835
#>  8:     8 0.61129007
#>  9:     9 0.18292229
#> 10:    10 0.22569555

# remove by join
e <- data.table(x = 2:5)
data_table_remove_rows(d, e, on = "x")

#>        x          y
#>    <int>      <num>
#> 1:     1 0.77326131
#> 2:     6 0.73692709
#> 3:     7 0.05382835
#> 4:     8 0.61129007
#> 5:     9 0.18292229
#> 6:    10 0.22569555

# add back
data_table_add_rows(d, c(e, list(y = runif(nrow(e)))))

#>         x          y
#>     <int>      <num>
#>  1:     1 0.77326131
#>  2:     6 0.73692709
#>  3:     7 0.05382835
#>  4:     8 0.61129007
#>  5:     9 0.18292229
#>  6:    10 0.22569555
#>  7:     2 0.99372144
#>  8:     3 0.03363720
#>  9:     4 0.69880083
#> 10:     5 0.67863547

# keep by join
data_table_remove_rows(d, !e, on = "x")

#>        x         y
#>    <int>     <num>
#> 1:     2 0.9937214
#> 2:     3 0.0336372
#> 3:     4 0.6988008
#> 4:     5 0.6786355

EDIT: Thanks to Matt Summersgill I for a slightly better performing version of this!

Upvotes: 1

Matt Dowle

Reputation: 59612

Good question. data.table can't delete rows by reference yet.

data.table can add and delete columns by reference since it over-allocates the vector of column pointers, as you know. The plan is to do something similar for rows and allow fast insert and delete. A row delete would use memmove in C to budge up the items (in each and every column) after the deleted rows. Deleting a row in the middle of the table would still be quite inefficient compared to a row store database such as SQL, which is more suited for fast insert and delete of rows wherever those rows are in the table. But still, it would be a lot faster than copying a new large object without the deleted rows.

On the other hand, since column vectors would be over-allocated, rows could be inserted (and deleted) at the end, instantly; e.g., a growing time series.

It's filed as an issue: Delete rows by reference.

Upvotes: 132

rferrisx

Reputation: 1728

Here are some strategies I have used. I believe a .ROW function may be coming. None of these approaches below are fast. These are some strategies a little beyond subsets or filtering. I tried to think like dba just trying to clean up data. As noted above, you can select or remove rows in data.table:

data(iris)
iris <- data.table(iris)

iris[3] # Select row three

iris[-3] # Remove row three

You can also use .SD to select or remove rows:

iris[,.SD[3]] # Select row three

iris[,.SD[3:6],by=,.(Species)] # Select row 3 - 6 for each Species

iris[,.SD[-3]] # Remove row three

iris[,.SD[-3:-6],by=,.(Species)] # Remove row 3 - 6 for each Species

Note: .SD creates a subset of the original data and allows you to do quite a bit of work in j or subsequent data.table. See https://stackoverflow.com/a/47406952/305675. Here I ordered my irises by Sepal Length, take a specified Sepal.Length as minimum,select the top three (by Sepal Length) of all Species and return all accompanying data:

iris[order(-Sepal.Length)][Sepal.Length > 3,.SD[1:3],by=,.(Species)]

The approaches above all reorder a data.table sequentially when removing rows. You can transpose a data.table and remove or replace the old rows which are now transposed columns. When using ':=NULL' to remove a transposed row, the subsequent column name is removed as well:

m_iris <- data.table(t(iris))[,V3:=NULL] # V3 column removed

d_iris <- data.table(t(iris))[,V3:=V2] # V3 column replaced with V2

When you transpose the data.frame back to a data.table, you may want to rename from the original data.table and restore class attributes in the case of deletion. Applying ":=NULL" to a now transposed data.table creates all character classes.

m_iris <- data.table(t(d_iris));
setnames(d_iris,names(iris))

d_iris <- data.table(t(m_iris));
setnames(m_iris,names(iris))

You may just want to remove duplicate rows which you can do with or without a Key:

d_iris[,Key:=paste0(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species)]     

d_iris[!duplicated(Key),]

d_iris[!duplicated(paste0(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species)),]

It is also possible to add an incremental counter with '.I'. You can then search for duplicated keys or fields and remove them by removing the record with the counter. This is computationally expensive, but has some advantages since you can print the lines to be removed.

d_iris[,I:=.I,] # add a counter field

d_iris[,Key:=paste0(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species)]

for(i in d_iris[duplicated(Key),I]) {print(i)} # See lines with duplicated Key or Field

for(i in d_iris[duplicated(Key),I]) {d_iris <- d_iris[!I == i,]} # Remove lines with duplicated Key or any particular field.

You can also just fill a row with 0s or NAs and then use an i query to delete them:

 X 
   x v foo
1: c 8   4
2: b 7   2

X[1] <- c(0)

X
   x v foo
1: 0 0   0
2: b 7   2

X[2] <- c(NA)
X
    x  v foo
1:  0  0   0
2: NA NA  NA

X <- X[x != 0,]
X <- X[!is.na(x),]

Upvotes: 3

JRR

Reputation: 3253

The topic is still interesting many people (me included).

What about that? I used assign to replace the glovalenv and the code described previously. It would be better to capture the original environment but at least in globalenv it is memory efficient and acts like a change by ref.

delete <- function(DT, del.idxs) 
{ 
  varname = deparse(substitute(DT))

  keep.idxs <- setdiff(DT[, .I], del.idxs)
  cols = names(DT);
  DT.subset <- data.table(DT[[1]][keep.idxs])
  setnames(DT.subset, cols[1])

  for (col in cols[2:length(cols)]) 
  {
    DT.subset[, (col) := DT[[col]][keep.idxs]]
    DT[, (col) := NULL];  # delete
  }

  assign(varname, DT.subset, envir = globalenv())
  return(invisible())
}

DT = data.table(x = rep(c("a", "b", "c"), each = 3), y = c(1, 3, 6), v = 1:9)
delete(DT, 3)

Upvotes: 5

user7114184

Reputation:

Here is a working function based on @vc273's answer and @Frank's feedback.

delete <- function(DT, del.idxs) {           # pls note 'del.idxs' vs. 'keep.idxs'
  keep.idxs <- setdiff(DT[, .I], del.idxs);  # select row indexes to keep
  cols = names(DT);
  DT.subset <- data.table(DT[[1]][keep.idxs]); # this is the subsetted table
  setnames(DT.subset, cols[1]);
  for (col in cols[2:length(cols)]) {
    DT.subset[, (col) := DT[[col]][keep.idxs]];
    DT[, (col) := NULL];  # delete
  }
   return(DT.subset);
}

And example of its usage:

dat <- delete(dat,del.idxs)   ## Pls note 'del.idxs' instead of 'keep.idxs'

Where "dat" is a data.table. Removing 14k rows from 1.4M rows takes 0.25 sec on my laptop.

> dim(dat)
[1] 1419393      25
> system.time(dat <- delete(dat,del.idxs))
   user  system elapsed 
   0.23    0.02    0.25 
> dim(dat)
[1] 1404715      25
>

PS. Since I am new to SO, I could not add comment to @vc273's thread :-(

Upvotes: 7

vc273

Reputation: 699

the approach that i have taken in order to make memory use be similar to in-place deletion is to subset a column at a time and delete. not as fast as a proper C memmove solution, but memory use is all i care about here. something like this:

DT = data.table(col1 = 1:1e6)
cols = paste0('col', 2:100)
for (col in cols){ DT[, (col) := 1:1e6] }
keep.idxs = sample(1e6, 9e5, FALSE) # keep 90% of entries
DT.subset = data.table(col1 = DT[['col1']][keep.idxs]) # this is the subsetted table
for (col in cols){
  DT.subset[, (col) := DT[[col]][keep.idxs]]
  DT[, (col) := NULL] #delete
}

Upvotes: 32

IRTFM

Reputation: 263481

Instead or trying to set to NULL, try setting to NA (matching the NA-type for the first column)

set(DT,1:2, 1:3 ,NA_character_)

Upvotes: 4

How to delete a row by reference in data.table?

Answers (8)

Related Questions