Reputation: 719
I have already pre-cleaned the data, and below shows the format of the top 4 rows:
[IN] df.head()
[OUT] Year cleaned
0 1909 acquaint hous receiv follow letter clerk crown...
1 1909 ask secretari state war whether issu statement...
2 1909 i beg present petit sign upward motor car driv...
3 1909 i desir ask secretari state war second lieuten...
4 1909 ask secretari state war whether would introduc...
I have called train_test_split() as follows:
[IN] X_train, X_test, y_train, y_test = train_test_split(df['cleaned'], df['Year'], random_state=2)
[Note*] `X_train` and `y_train` are now Pandas.core.series.Series of shape (1785,) and `X_test` and `y_test` are also Pandas.core.series.Series of shape (595,)
I have then vectorized the X training and testing data using the following TfidfVectorizer and fit/transform procedures:
[IN] v = TfidfVectorizer(decode_error='replace', encoding='utf-8', stop_words='english', ngram_range=(1, 1), sublinear_tf=True)
X_train = v.fit_transform(X_train)
X_test = v.transform(X_test)
I'm now at the stage where I would normally apply a classifier, etc (if this were a balanced set of data). However, I initialize imblearn's SMOTE() class (to perform over-sampling)...
[IN] smote_pipeline = make_pipeline_imb(SMOTE(), classifier(random_state=42))
smote_model = smote_pipeline.fit(X_train, y_train)
smote_prediction = smote_model.predict(X_test)
... but this results in:
[OUT] ValueError: "Expected n_neighbors <= n_samples, but n_samples = 5, n_neighbors = 6.
I've attempted to whittle down the number of n_neighbors but to no avail, any tips or advice would be much appreciated. Thanks for reading.
------------------------------------------------------------------------------------------------------------------------------------
EDIT:
The dataset/dataframe (df
) contains 2380 rows across two columns, as shown in df.head()
above. X_train
contains 1785 of these rows in the format of a list of strings (df['cleaned']
) and y_train
also contains 1785 rows in the format of strings (df['Year']
).
Post-vectorization using TfidfVectorizer()
: X_train
and X_test
are converted from pandas.core.series.Series
of shape '(1785,)' and '(595,)' respectively, to scipy.sparse.csr.csr_matrix
of shape '(1785, 126459)' and '(595, 126459)' respectively.
As for the number of classes: using Counter()
, I've calculated that there are 199 classes (Years), each instance of a class is attached to one element of aforementioned df['cleaned']
data which contains a list of strings extracted from a textual corpus.
The objective of this process is to automatically determine/guess the year, decade or century (any degree of classification will do!) of input textual data based on vocabularly present.
Upvotes: 17
Views: 58142
Reputation: 591
I was able to solve this issue following number 1 of this answer.
from collections import Counter
Count(df) # get the classes
# drop the classes with 1 as their value because it's lower than k_neighbors which has 2 as minimum value in my case
X_res, y_res = SMOTE(k_neighbors = 2).fit_resample(X, y)
Upvotes: 0
Reputation: 2655
WHY IT OCCURS:
In my case it was occurring because i had as few samples as 1 for some of the values/categories. Since SMOTE is based on KNN concept, it's not possible to apply SMOTE on 1 sampled values.
HOW I SOLVED IT:
Since those 1 sampled values/categories were equivalent to outliers, i removed them from the dataset and then applied SMOTE and it worked.
Also try decreasing the
k_neighbors
parameter to make it work
xr, yr = SMOTE(k_neighbors=3).fit_resample(x, y)
Upvotes: 4
Reputation: 1
I think that's possible to use the code:
sampler = SMOTE(ratio={1: 1927, 0: 300},random_state=0)
Upvotes: 0
Reputation: 213
Try to do the below code for SMOTE
oversampler=SMOTE(kind='regular',k_neighbors=2)
This worked for me.
Upvotes: 7
Reputation:
Since there are approximately 200 classes and 1800 samples in the training set, you have on average 9 samples per class. The reason for the error message is that a) probably the data are not perfectly balanced and there are classes with less than 6 samples and b) the number of neighbors is 6. A few solutions for your problem:
Calculate the minimum number of samples (n_samples) among the 199 classes and select n_neighbors
parameter of SMOTE class less or equal to n_samples.
Exclude from oversampling the classes with n_samples < n_neighbors using the ratio
parameter of SMOTE
class.
Use RandomOverSampler
class which does not have a similar restriction.
Combine 3 and 4 solutions: Create a pipeline that is using SMOTE
and RandomOversampler
in a way that satisfies the condition n_neighbors <= n_samples for smoted classes and uses random oversampling when the condition is not satisfied.
Upvotes: 26