Liu Yong
Liu Yong

Reputation: 535

why my random forest gets worse performance than decision tree

It is my first random forest practice, unfortunately it gets worse performance than a single decision tree. I have been working on this for while, but failed to figure out where is the problem. Below is some running records. (I am so sorry for posting complete code.)

Sklearn Decision Tree Classifier 0.714285714286
Sklearn Random Forest Classifier 0.714285714286
My home made Random Forest Classifier 0.628571428571

Sklearn Decision Tree Classifier 0.642857142857
Sklearn Random Forest Classifier 0.814285714286
My home made Random Forest Classifier 0.571428571429

Sklearn Decision Tree Classifier 0.757142857143
Sklearn Random Forest Classifier 0.771428571429
My home made Random Forest Classifier 0.585714285714

I use sonar dataset from this ,(Sonar,+Mines+vs.+Rocks) because it has about 60 features.

import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# section 1: read data, shuffle, change label from string to float
filename = "sonar_all_data.csv"
colnames = ['c'+str(i) for i in range(60)]
colnames.append('type')
df = pd.read_csv(filename, index_col=None, header=None, names=colnames)
df = df.sample(frac=1).reset_index(drop=True)
df['lbl'] = 1.0
df.loc[df['type']=='R', 'lbl'] = 0.0
df.drop('type', axis=1, inplace=True)
df.astype(np.float32, inplace=True)
feature_names = ['c' + str(i) for i in range(60)]
label_name =['lbl']

# section 2: prep train and test data
test_x = df[:70][feature_names].get_values()
test_y = df[:70][label_name].get_values().ravel()
train_x = df[70:][feature_names].get_values()
train_y = df[70:][label_name].get_values().ravel()

# section 3: take a look at performance of sklearn decision tree and randomforest
clf = DecisionTreeClassifier()
clf.fit(train_x, train_y)
print("Sklearn Decision Tree Classifier", clf.score(test_x, test_y))

rfclf = RandomForestClassifier(n_jobs=2)
rfclf.fit(train_x, train_y)
print("Sklearn Random Forest Classifier", rfclf.score(test_x, test_y))


# section 4: my first practice of random forest
m = 10
votes = [1/m] * m
num_train = len(train_x)
num_feat = len(train_x[0])


n = int(num_train * 0.6)
k = int(np.sqrt(num_feat))

index_of_train_data = np.arange(num_train)
index_of_train_feat = np.arange(num_feat)

clfs = [DecisionTreeClassifier() for _ in range(m)]
feats = []

for i, xclf in enumerate(clfs):
    np.random.shuffle(index_of_train_data)
    np.random.shuffle(index_of_train_feat)
    row_idx = index_of_train_data[:n]
    feat_idx = index_of_train_feat[:k]
    sub_train_x = train_x[row_idx,:][:, feat_idx]
    sub_train_y = train_y[row_idx]
    xclf.fit(sub_train_x, sub_train_y)
    feats.append(feat_idx)

pred = np.zeros(test_y.shape)

for clf, feat, vote in zip(clfs, feats, votes):
    pred += clf.predict(test_x[:, feat]) * vote

pred[pred  > 0.5] = 1.0
pred[pred <= 0.5] = 0.0
print("My home made Random Forest Classifier", sum(pred==test_y)/len(test_y))

Upvotes: 1

Views: 3002

Answers (1)

Fringant
Fringant

Reputation: 555

As chrisckwong821 put it, you are overfitting: if you construct a random forest that is too deep, it will too much look like your training data, and will badly predict new (test) data.

Upvotes: 1

Related Questions