Reputation: 147

How to apply a function to a range of datatable columns without having to specify new column names?

I am normalizing only a few columns. I am subsetting the columns by specifying a range of columns using .SDcols. I don't know how to name the columns that result from the operation. The names need to be the original names.

dtbl <- data.table("col1"=c(1,2,3), "col2"=c(4,5,6), "col3"=c(7,8,9), "coln"=c("A", "B", "C"))
cols <- c("col1", "col2", "col3")
dtbl[, (cols) := lapply(.SD, function(x) ((x-mn)/(mx-mn))), .SDcols = cols]
# mn and mx are some predefined values that I am using to normalise the columns.

I am instead trying the following

dtbl[, **original_names** := lapply(.SD, function(x) ((x-mn)/(mx-mn))), .SDcols = col1:col3]

It is not working obviously. I am not sure how to get back the names of the columns that are being operated upon.

Edit:

I want to avoid typing the columns names. In this example, there are only three names but in my data there are several. Hence, I would like to subset a range of columns instead.

Upvotes: 2

Answers (2)

Richard Layton

Reputation: 126

If you intend to apply the function to ALL numeric columns in the data table, you can use .SDcols = is.numeric to identify the column names then use lapply() like you proposed originally.

library(data.table)

# example
dtbl <- data.table(
  col1 = c(1,2,3),
  col2 = c(4,5,6),
  col3 = c(7,8,9),
  coln = letters[1:3]
)

# arbitrary constants
mn <- 1
mx <- 2

# determine all numeric columns
cols <- names(dtbl[, .SD, .SDcols = is.numeric])

# apply function to numeric columns
dtbl[, (cols) := lapply(.SD, function(x) ((x - mn)/(mx - mn))), .SDcols = cols]

Alternatively, you can operate on the data frame in the same way without naming the columns at all.

dtbl <- dtbl[, lapply(.SD, function(x) ((x - mn)/(mx - mn))), .SDcols = is.numeric]

Upvotes: 2

langtang

Reputation: 24845

In this situation, you can use set:

cols = names(dtbl)[grepl("col\\d",names(dtbl))]
for(c in cols) set(
  x = dtbl,
  j=c,
  value=(dtbl[[c]]-min(dtbl[[c]]))/(max(dtbl[[c]])-min(dtbl[[c]]))
)

Another option is to use lapply as you were, without attempting to assign any names, and cbind to the other columns

f <- function(x) (x-min(x))/(max(x)-min(x))
cbind(
  dtbl[, lapply(.SD, f), .SDcols = cols],
  dtbl[,.SD, .SDcols = !cols]
)

In both cases, output is as follows:

   col1 col2 col3 coln
1:  0.0  0.0  0.0    a
2:  0.5  0.5  0.5    b
3:  1.0  1.0  1.0    c

Input:

dtbl <- data.table(
  col1 = c(1,2,3),
  col2 = c(4,5,6),
  col3 = c(7,8,9),
  coln = letters[1:3]
)

There are lots of ways to get the names of the columns. Above, I've shown the use of a regex. However, you could also use

cols = names(dtbl)[1:3]

In .SDcols, you can use patterns(), like this

.SDcols=patterns("col\\d")

You can even do something like this:

for(c in c(1,2,3)) set(
  x = dtbl,
  j=names(dtbl)[c],
  value=(dtbl[[c]]-min(dtbl[[c]]))/(max(dtbl[[c]])-min(dtbl[[c]]))
)

Upvotes: 2

How to apply a function to a range of datatable columns without having to specify new column names?

Answers (2)

Related Questions