Reputation: 717
I am using Scikit-learn for a binary classification task.. and I have: Class 0: with 200 observations Class 1: with 50 observations
And because I have an unbalanced data.. I want to take a random subsample of the majority class where the number of observations will be the same as the minority class and want to use the new obtained dataset as an input to the classifier .. the process of subsampling and classifying can be repeated many times .. I've the following code for the subsampling with mainly the help of Ami Tavory
docs_train=load_files(rootdir,categories=categories, encoding='latin-1')
X_train = docs_train.data
y_train = docs_train.target
majority_x,majority_y=x[y==0,:],y[y==0] # assuming that class 0 is the majority class
minority_x,minority_y=x[y==1,:],y[y==1]
inds=np.random.choice(range(majority_x.shape[0]),50)
majority_x=majority_x[inds,:]
majority_y=majority_y[inds]
It works like a charm, however, at the end of processing the majority_x and majority_y I want to be able to replace the old set that represent class0 in X_train, y_train with the new smaller set in order to pass it as follow to the classifier or the pipeline:
pipeline = Pipeline([
('vectorizer', CountVectorizer( tokenizer=tokens, binary=True)),
('classifier',SVC(C=1,kernel='linear')) ])
pipeline.fit(X_train, y_train)
What I have done In order to solve this: since the resulted arrays where numpy arrays, and because I am new to the whole area and I am really trying very hard to learn .. I've tried to combine the two resulted arrays together majority_x+minority_x in order to form the training data that I want .. I couldn't it gave some errors which I am trying to solve until now ... but even if I could .. how can I keep their index so the majority_y and minority_y will be true as well !
Upvotes: 4
Views: 1828
Reputation: 1480
After processing majority_x and minority_y you can merge your training sets with
X_train = np.concatenate((majority_x,minority_x))
y_train = np.concatenate((majority_y,minority_y))
Now X_train and y_train will first contain the chosen samples with y=0 and then the samples with y=1.
An idea for your related question: Make your choice of the majority samples by creating a random permutation vector of the length of the number of your majority samples. Then choose the first 50 indices of that vector, then the next 50 and so on. When you are through with that vector, each sample will have been chosen exactly once. If you want more iterations or the remaining permutation vector is too short, you can resort back to random choice.
As I mentioned in my comment, you might also want to add the parameter "replace=False" in your np.random.choice, if you want to prevent having the same sample multiple times in one iteration.
Upvotes: 1