Anand
Anand

Reputation: 37

Cluster Analysis using R for large data sample

I am just starting out with segmenting a customer database using R I have for an ecommerce retail business. I seek some guidance about the best approach to proceed with for this exercise. I have searched the topics already posted here and tried them out myself like dist() and hclust(). However I am running into one issue or another and not able to overcome it since I am new to using R. Here is the brief description of my problem. I have approximately 480K records of customers who have bought so far. The data contains following columns:

The business goal of this exercise is to identify the most profitable segments and encourage repeat purchases in those segments using campaigns. Can I please get some guidance as to how to do this successfully without running into problems like the size of the sample or the data type of columns?

Upvotes: 0

Views: 1428

Answers (2)

knb
knb

Reputation: 9295

Sadly your data does not contain any attributes that indicate what types of items/transactions did NOT result in a sale.

I am not sure if clustering is the way to go here.

Here are some ideas:

First split your data into a training set (say 70%) and a test set.

Set up a simple linear regression model with,say, "average basket value" as a response variable, and all other variables as independent variables.

fit <-lm(averagebasketvalue ~., data = custdata)

Run the model on the training set, determine significant attributes (those with at least one star in the summary(fit) output), then focus on those variables.

Check your regression coefficients on the test set, by calculating R-squared and Sum of squared errors (SSE) on the test set. You can use the predict() function , the calls will look like

fitpred <- predict(fit, newdata=testset)
summary(fitpred) # will give you R²

Maybe "city" contains too many unique values to be meaningful. Try to generalize them by introducing a new attribute CityClass (e.g. BigCity-MediumCity-SmallCity ... or whatever classification scheme is useful for your cities). You might also condition the model on "gender". Drop "email id".

This can go on for a while... play with the model to try to get better R-squared and SSEs.

I think a tree-based model (rpart) might also work well here.

Then you might change to cluster analysis at a later time.

Upvotes: 0

blakeoft
blakeoft

Reputation: 2400

Read this to learn how to subset data frames. When you try to define d, it looks like you're providing way to much data, which might be fixed by subsetting your table first. If not, you might want to take a random sample of your data instead of all of it. Suppose you know that columns 4 through 10 of your data frame called cust_data contain numerical data, then you might try this:

cust_data2 <- cust_data[, 4:10]
d <- dist(cust_data2)

For large values, you may want to log transform them--just experiment and see what makes sense. I really am not sure about this, and that's just a suggestion. Maybe choosing a more appropriate clustering or distance metric would be better.

Finally, when you run hclust, you need to pass in the d matrix, and not the original data set.

h <- hclust(d, "ave")

Upvotes: 1

Related Questions