Reputation: 147
I am normalizing only a few columns. I am subsetting the columns by specifying a range of columns using .SDcols
. I don't know how to name the columns that result from the operation. The names need to be the original names.
dtbl <- data.table("col1"=c(1,2,3), "col2"=c(4,5,6), "col3"=c(7,8,9), "coln"=c("A", "B", "C"))
cols <- c("col1", "col2", "col3")
dtbl[, (cols) := lapply(.SD, function(x) ((x-mn)/(mx-mn))), .SDcols = cols]
# mn and mx are some predefined values that I am using to normalise the columns.
I am instead trying the following
dtbl[, **original_names** := lapply(.SD, function(x) ((x-mn)/(mx-mn))), .SDcols = col1:col3]
It is not working obviously. I am not sure how to get back the names of the columns that are being operated upon.
Edit:
I want to avoid typing the columns names. In this example, there are only three names but in my data there are several. Hence, I would like to subset a range of columns instead.
Upvotes: 2
Views: 66
Reputation: 126
If you intend to apply the function to ALL numeric columns in the data table, you can use .SDcols = is.numeric
to identify the column names then use lapply()
like you proposed originally.
library(data.table)
# example
dtbl <- data.table(
col1 = c(1,2,3),
col2 = c(4,5,6),
col3 = c(7,8,9),
coln = letters[1:3]
)
# arbitrary constants
mn <- 1
mx <- 2
# determine all numeric columns
cols <- names(dtbl[, .SD, .SDcols = is.numeric])
# apply function to numeric columns
dtbl[, (cols) := lapply(.SD, function(x) ((x - mn)/(mx - mn))), .SDcols = cols]
Alternatively, you can operate on the data frame in the same way without naming the columns at all.
dtbl <- dtbl[, lapply(.SD, function(x) ((x - mn)/(mx - mn))), .SDcols = is.numeric]
Upvotes: 2
Reputation: 24845
In this situation, you can use set
:
cols = names(dtbl)[grepl("col\\d",names(dtbl))]
for(c in cols) set(
x = dtbl,
j=c,
value=(dtbl[[c]]-min(dtbl[[c]]))/(max(dtbl[[c]])-min(dtbl[[c]]))
)
Another option is to use lapply
as you were, without attempting to assign any names, and cbind to the other columns
f <- function(x) (x-min(x))/(max(x)-min(x))
cbind(
dtbl[, lapply(.SD, f), .SDcols = cols],
dtbl[,.SD, .SDcols = !cols]
)
In both cases, output is as follows:
col1 col2 col3 coln
1: 0.0 0.0 0.0 a
2: 0.5 0.5 0.5 b
3: 1.0 1.0 1.0 c
Input:
dtbl <- data.table(
col1 = c(1,2,3),
col2 = c(4,5,6),
col3 = c(7,8,9),
coln = letters[1:3]
)
There are lots of ways to get the names of the columns. Above, I've shown the use of a regex. However, you could also use
cols = names(dtbl)[1:3]
In .SDcols, you can use patterns()
, like this
.SDcols=patterns("col\\d")
You can even do something like this:
for(c in c(1,2,3)) set(
x = dtbl,
j=names(dtbl)[c],
value=(dtbl[[c]]-min(dtbl[[c]]))/(max(dtbl[[c]])-min(dtbl[[c]]))
)
Upvotes: 2