Reputation: 1722
I have data like this
set.seed(1)
dt <- data.table(id = c("A", "A", "B", "B","C", "C"),
var1 = c(1:6),
var2 = rnorm(6))
> dt
id var1 var2
1: A 1 -0.6264538
2: A 2 0.1836433
3: B 3 -0.8356286
4: B 4 1.5952808
5: C 5 0.3295078
6: C 6 -0.8204684
but with dozens of numeric variables. I'd like to calculate percentile for each observation and every numeric variable using data.table
, while keeping a key identifier (id
) intact. In dplyr
I could do it like this:
mutate_if(dt, is.numeric, function(x) { ecdf(x)(x) })
id var1 var2
1 A 0.1666667 0.5000000
2 A 0.3333333 0.6666667
3 B 0.5000000 0.1666667
4 B 0.6666667 1.0000000
5 C 0.8333333 0.8333333
6 C 1.0000000 0.3333333
I would be also happy with the result including original var1
and var2
.
What would be the best way to approach this?
Thanks for help!
Upvotes: 1
Views: 1419
Reputation: 9313
You could calculate the ecdf
for all numeric columns in a separate data table like this:
dt2 = as.data.table(lapply(dt,function(x){if(is.numeric(x)){ecdf(x)(x)}}))
Result:
> dt2
var1 var2
1: 0.1666667 0.8333333
2: 0.3333333 0.3333333
3: 0.5000000 0.6666667
4: 0.6666667 1.0000000
5: 0.8333333 0.1666667
6: 1.0000000 0.5000000
If you want to cbind
this result to the original dt, you could change the column names using paste0
:
colnames(dt2) = paste0("centile_",colnames(dt2))
Result:
> dt2
centile_var1 centile_var2
1: 0.1666667 0.8333333
2: 0.3333333 0.3333333
3: 0.5000000 0.6666667
4: 0.6666667 1.0000000
5: 0.8333333 0.1666667
6: 1.0000000 0.5000000
Upvotes: 3