Reputation: 21
I have a dataframe (df1) and have calculated the deciles for each row using the following:
#create a function to calculate the deciles
decilefun <- function(x) as.integer(cut(x, unique(quantile(x, probs=0:10/10)), include.lowest=TRUE))
# convert df1 to matrix
mat1 <- as.matrix(df1)
#apply the function I created above to calculate deciles
df1_deciles <- apply(mat1, 1, decilefun)
#add the rownames back in
rownames(df1_deciles) <- row.names(df1)
#convert to dataframe
df1_deciles <- as.data.frame(df1_deciles)
str(df1_deciles) # to show what the data looks like
#'data.frame': 157 obs. of 3321 variables:
# $ Variable1 : int 10 10 4 4 5 8 8 8 6 3 ...
# $ Variable2 : int 8 3 9 7 2 8 9 5 8 2 ...
# $ Variable3 : int 8 4 7 7 2 9 10 3 8 3 ...
I have another dataframe (df2) with the same rownames (Variable1, Variable2,etc...) but different number of columns.
I would like to use the same decile cuts which were used for df1 on this second dataframe but I'm not sure how to do it. I am actually not even sure how to determine/export what the cuts where on the original data which resulted on the df1_deciles dataframe I created. What I mean by this is, how do I export an object which tells me what range of values for Variable1 on df1 were assigned to a decile value = 1 or a decile value = 2, and so on.
I do not want to use the 'decilefun' function I created on df2, but instead want to use the variability and range information from df1.
This is my first question on the platform so I hope it is clear and I hope I have provided enough information. I have tried to find answers on the platform but have not found one. I appreciate any help on this.
Upvotes: 2
Views: 518
Reputation: 59355
Using data.table
:
##
# create an artificial dataset with the structure you describe
#
set.seed(1)
df1 <- data.frame(Variable.1=rnorm(1000), variable.2=runif(1000), variable.3=rgamma(1000, scale=10, shape=5))
df1 <- t(df1)
##
#
df2 <- data.frame(Variable.1=rnorm(1000, -1), variable.2=runif(1000), variable.3=rgamma(1000, scale=20, shape=5))
df2 <- t(df2)
##
# you start here
# assumes df1 and df2 have structure described in problem
# data in rows, not columns
#
library(data.table)
df1 <- as.data.table(t(df1)) # transpose: put data in columns
brks <- lapply(df1, quantile, probs=(0:10)/10, labels=FALSE) # list of deciles for each row in df1
df2 <- as.data.table(df2, keep.rownames = TRUE) # keep df2 data in rows: 1000 columns here
result <- df2[ # this does all the work
, .(value= unlist(.SD),
decile=cut(unlist(.SD), breaks=c(-Inf, brks[[rn]], +Inf), labels=c('below', names(brks[[rn]])[2:11], 'above'))
)
, by=.(rn)]
result[, .N, keyby=.(rn, decile)] # validate that result is reasonable
Applying deciles from one dataset to another has the nuance the some values in the new dataset might be outside the range of the original data. The test data here demonstrates this problem. Variable.1
in df2
has values lower than any in df1
, and variable.3
in df2
has values larger than any in df1
.
Upvotes: 1