Reputation: 43
I am a beginner in data science and need help with a topic.
I have a data set about the customers of an institution. My goal is to first find out which customers will pay to this institution and then find out how much money the paying customers will pay.
In this context, I think that I can first find out which customers will pay by "classification" and then how much will pay by applying "regression".
So, first I want to apply "classification" and then apply "regression" to this output. How can I do that?
Upvotes: 2
Views: 2725
Reputation: 2410
Sure, you can definitely apply a classification method followed by regression analysis. This is actually a common pattern during exploratory data analysis.
For your use case, based on the basic info you are sharing, I would intuitively go for 1) logistic regression and 2) multiple linear regression.
Logistic regression is actually a classification tool, even though the name suggests otherwise. In a binary logistic regression model, the dependent variable has two levels (categorical), which is what you need to predict if your customers will pay vs. will not pay (binary decision)
The multiple linear regression, applied to the same independent variables from your available dataset, will then provide you with a linear model to predict how much your customers will pay (ie. the output of the inference will be a continuous variable - the actual expected dollar value).
That would be the approach I would recommend to implement, since you are new to this field. Now, there are obviously many different other ways to define these models, based on available data, nature of the data, customer requirements and so on, but the logistic + multiple regression approach should be a sure bet to get you going.
Upvotes: 2
Reputation: 124
Of course you can do it by vertically stacking models. Assuming that you are using binary classification, after prediction you will have a dataframe with target values 0 and 1. You are going to filter where target==1 and create a new dataframe. Then run the regression.
Also, rather than classification, you can use clustering if you don't have labels since the cost is lower.
Upvotes: 1
Reputation: 4629
Another approach would be to make it a pure regression only. Without working on a cascade of models. Which will be more simple to handle
For example, you could associate to the people that are not willing to pay the value 0 to the spended amount, and fit the model on these instances.
For the business, you could then apply a threshold in which if the predicted amount is under a more or less fixed threshold, you classify the user as "non willing to pay"
Upvotes: 1