Reputation: 127
I am performing a clutser analysis on data from the following site.
https://www.kaggle.com/arjunbhasin2013/ccdata/version/1#
I have segmented the dataset using a 7 cluster solution using the following code.
library(cluster)
library(dplyr)
CC_data <- read.csv("CC_GENERAL.csv")
DistMatrix <- dist(CC_data[2:17])
Ward_CCD <- hclust(DistMatrix, method = "ward.D2")
CCD_hclust_cut <- cutree(tree = Ward_CCD, k = 7)
CC_data <- mutate(CC_data, cluster = CCD_hclust_cut)
# Subset the data into individual clusters for further analysis
for (C in 1:7) {
assign(paste0("cluster", C),filter(CC_data, cluster == C))
}
Now I want to subset each cluster and generate boxplots to summarise the data. The problem is, some of the data has been scaled [0,1], while the rest is in absolute dollar values and one column is a percentage value that needs to be rescaled (PRC_FULL_PAYMENT).
I want to create two sets of boxplots for each cluster solution, using a loop to change the cluster being referred to in the code. Doing things manually, the code I have is:
C1_frequency <- data.frame(
cluster1$BALANCE_FREQUENCY,
cluster1$PURCHASES_FREQUENCY,
cluster1$ONEOFF_PURCHASES_FREQUENCY,
cluster1$PURCHASES_INSTALLMENTS_FREQUENCY,
cluster1$CASH_ADVANCE_FREQUENCY,
cluster1$PRC_FULL_PAYMENT / 100
)
C1_unscaled <- data.frame(
cluster1$BALANCE,
cluster1$PURCHASES,
cluster1$ONEOFF_PURCHASES,
cluster1$INSTALLMENTS_PURCHASES,
cluster1$CASH_ADVANCE,
cluster1$CASH_ADVANCE_TRX,
cluster1$PURCHASES_TRX,
cluster1$CREDIT_LIMIT,
cluster1$PAYMENTS,
cluster1$MINIMUM_PAYMENTS
)
This works OK, but I want to avoid the needless repetition by using some sort of loop. I've been trying to use various combinations of the assign() and paste0() functions, as well as one attempt at using [[]] which I still don't really understand, but I keep getting different errors each time I try something.
How can I change the cluster number for 1:7 without doing a copy and paste job?
Upvotes: 1
Views: 64
Reputation: 3090
Someone can probably provide a more elegant answer, but here's a quick'n'dirty solution:
library(dplyr)
for (i in 1:7) {
assign(paste0("C", i, "_frequency"), {
get(paste0("cluster", i)) %>%
mutate(PRC_FULL_PAYMENT_SCALED = PRC_FULL_PAYMENT / 100) %>%
select(BALANCE_FREQUENCY, PURCHASES_FREQUENCY, ONEOFF_PURCHASES_FREQUENCY, PURCHASES_INSTALLMENTS_FREQUENCY, CASH_ADVANCE_FREQUENCY, PRC_FULL_PAYMENT_SCALED)
})
assign(paste0("C", i, "_unscaled"), {
get(paste0("cluster", i)) %>%
mutate(PRC_FULL_PAYMENT_SCALED = PRC_FULL_PAYMENT / 100) %>%
select(BALANCE, PURCHASES, ONEOFF_PURCHASES, INSTALLMENTS_PURCHASES, CASH_ADVANCE, CASH_ADVANCE_TRX, PURCHASES_TRX, CREDIT_LIMIT, PAYMENTS, MINIMUM_PAYMENTS)
})
}
Upvotes: 3
Reputation: 389175
Maybe you could try to create a function
create_subset <- function(df) {
list(C1_frequency <- data.frame(
df$BALANCE_FREQUENCY,
df$PURCHASES_FREQUENCY,
df$ONEOFF_PURCHASES_FREQUENCY,
df$PURCHASES_INSTALLMENTS_FREQUENCY,
df$CASH_ADVANCE_FREQUENCY,
df$PRC_FULL_PAYMENT / 100),
C1_unscaled <- data.frame(
df$BALANCE,
df$PURCHASES,
df$ONEOFF_PURCHASES,
df$INSTALLMENTS_PURCHASES,
df$CASH_ADVANCE,
df$CASH_ADVANCE_TRX,
df$PURCHASES_TRX,
df$CREDIT_LIMIT,
df$PAYMENTS,
df$MINIMUM_PAYMENTS))
}
and then use lapply
to apply it to all clusters
lapply(mget(paste0("cluster", 1:7)), create_subset)
Also you could include any other code which you want to apply to each cluster (like boxplot
etc.) in the same function create_subset
.
Upvotes: 1