Reputation: 729
I saw multiple posts to meet my requirement, but some how not able to get the needed result.
I have a data.table with multiple columns. Out of all the columns I want to select few columns for their maximum value and summarize them by the group variable.
Below is how my sample data -
library("data.table")
set.seed(1200)
ID <- seq(1001,1100)
region <- sample(1:10,100,replace = T)
Q21 <- sample(1:5,100,replace = T)
Q22 <- sample(1:15,100,replace = T)
Q24_LOC_1 <- sample(1:8,100,replace = T)
Q24_LOC_2 <- sample(1:8,100,replace = T)
Q24_LOC_3 <- sample(1:8,100,replace = T)
Q24_LOC_4 <- sample(1:8,100,replace = T)
Q21_PAN <- sample(1:5,100,replace = T)
Q22_PAN <- sample(1:15,100,replace = T)
Q24_LOC_1_PAN <- sample(1:8,100,replace = T)
Q24_LOC_2_PAN <- sample(1:8,100,replace = T)
Q24_LOC_3_PAN <- sample(1:8,100,replace = T)
Q24_LOC_4_PAN <- sample(1:8,100,replace = T)
df1 <- as.data.table(data.frame(ID,region,Q21,Q22,Q24_LOC_1,Q24_LOC_2,Q24_LOC_3,Q24_LOC_4,Q21_PAN,Q22_PAN,Q24_LOC_1_PAN,Q24_LOC_2_PAN,Q24_LOC_3_PAN,Q24_LOC_4_PAN))
Now for the above data I want to select 4 columns for their maximum value by region. So the result should have the ID variable, region variable and these 4 variables with 10 rows. 1 row for each region. I tried below code but it creates a column as mycol and puts the value of the 4th element of vector mycol
mycol <- paste("Q24","LOC",seq(1:4),"PAN",sep = "_")
df2 <- df1[,.(mycol = max(mycol)),by=region]
Please suggest where I am going wrong and how I can achieve this.
Upvotes: 2
Views: 2513
Reputation: 887118
If we need to get max
, after grouping by 'region' and specifying the 'mycol' in .SDcols
, loop through the Subset of Data.table (.SD
) and get the max
df1[, lapply(.SD, max), by = region, .SDcols = mycol]
If there are 'region' that have only NA values, the max
will return with a warning as it returns Inf
. For example,
max(c(NA, NA), na.rm = TRUE)
#[1] -Inf
Warning message: In max(c(NA, NA), na.rm = TRUE) : no non-missing arguments to max; returning -Inf
To correct this, we could add an if/else
condition
df1[, lapply(.SD, function(x) if(all(is.na(x))) NA_integer_
else max(x, na.rm = TRUE)), by = region, .SDcols = mycol]
If we also need 'ID' as a paste
d string
df1[, c(list(ID = toString(ID)), lapply(.SD, max)), by = region, .SDcols = mycol]
Upvotes: 5