Reputation: 4645
Using machine learning I would like to identify features that influence net revenue
and make conclusions from data based on that. The data set is a car sharing company data (like Turo). Data set contains ~80000 rows 14 columns.
I have difficulty to build a EDA especially with ML algorithm to use to find out features that influence on net_revenue
.
#What I did so far
I did correlation matrix analysis on this data and find out 'youth driver fee'
has the most correlated
feature to 'net_revenue'
(
I kept make
and model
columns out of the analysis because there are so many makes and models and its hard to predict their effect on the net_revenue
)
I wanted to see this correlation is relevant with some ML algorithms such as Logistic regression
and Randomforest
. To further applying RandomForest ML to verify this correlation I converted categorical variables (payment_type, returning_guest and returning_host) to the dummy variables (0's and 1's)
So I tried to apply these two models by following this post
LogisticRegression
cols=['driver_age', 'completed_trips', 'vehicle_price', 'lead_time', 'trip_length',
'trip_revenue', 'youth_driver_fee', 'insurance_fee', 'delivery_fee', 'returning_quest_First_time','returning_quest_Repeat','returning_host_First_time','returning_host_repeat']
X=data[cols]
y=data['net_revenue']
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
*default settings of LogisticRegression
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class=’ovr’, n_jobs=1, penalty=’l2', random_state=None, solver=’liblinear’, tol=0.0001, verbose=0, warm_start=False)
**The IPython
notebook freezes after executing the code above and it looks like it would never output something.So I have to restart the kernel.
RandomForest
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
Same problem!
My Questions:
I found one dataset to predict features on target value but target value look like categorical mine is continuous. from https://www.kaggle.com/prasadkevin/prediction-of-quality-of-wine
to use LogisticRegression
and RandomForest
, has net_revenue
to be categorical variable?
Do you happen to know any similar dataset on Kaggle? because I could not find any correlated ML flow like this one!
Upvotes: -1
Views: 209
Reputation: 467
A few things.
When using any machine learning model, you have to convert every categorical variable to a dummy variable, not just for Random Forests.
You are using RandomForestClassifier
for a regression problem, which is not what you want. Instead use sklearn.ensemble.RandomForestRegressor
.
Your machine learning models are probably running if no errors are being thrown. Since you have 80,000 rows it may just take a while. When you define your models, define them as
logreg = LogisticRegression(verbose=1)
and
rf = RandomForestRegressor(verbose=1)
If the models are running they will print out their progress so you can see what is going on.
Upvotes: 1