Reputation: 123
I'm trying to create a text classification model using scikit-learn. At first, I was using only the text's tfidf array as a feature. The structure of my dataset can be seen below (the dataset is stored in a pandas dataframe called df
):
>>>df.head(2)
id_1 id_2 id_3 target text
11 454 320 197 some text here
15 440 111 205 text goes here too
>>>df.info()
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id_1 500 non-null uint16
1 id_2 500 non-null uint16
2 id_3 500 non-null uint16
3 target 500 non-null uint16
4 text 500 non-null object
So, I split the train/test datasets and proceeded with creating the tfidf vector and transforming the data for training and testing.
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['target'], random_state=0)
vectorizer = TfidfVectorizer(max_features=500, decode_error="ignore", ngram_range=(1, 2))
vectorizer.fit(X_train)
X_train_tfidf = vectorizer.transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)
So far apparently the code is working ok. However, there was a need to improve the algorithm, including yet another feature. For this improvment, I want to add the id_1
column to my features (it can be an important information to our ML model). So, in addition to my tfidf matrix, I would like to add this column (id_1
) with my new feature, so that I can pass it as a parameter to train the model.
What I have tried:
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['target'], random_state=0)
vectorizer = TfidfVectorizer(max_features=500, decode_error="ignore", ngram_range=(1, 2))
vectorizer.fit(X_train)
X_train_tfidf = vectorizer.transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)
X_train_all_features = pd.concat([pd.DataFrame(X_train_tfidf.toarray()), df['id_1']], axis = 1)
So, the shape of my structure is
>>>print(X_train_tfidf.shape)
(37, 500) # as expected (I'm loading 50 lines, so this is about 75%)
>>>print(X_train_all_features.shape)
(50, 501) # n of columns is expected, but not the lines, because the df[id_1] was not splited in train_test_split function
In a nutshel, I want pass to my ML algorithm something like the image below - my tfidf vector and my id_1
features:
I feel that I am missing something extremely basic, but even with all the research I have been able to solve my problem satisfactorily. I'm honestly lost in that part of the problem and I don't know how to evolve from here
Upvotes: 0
Views: 2336
Reputation: 61
Your df has 50 rows and X_train_tfidf 37, pd.concat() returns dataframes with 50 rows, with remaining 13 filled with NaN.
You added all values of your feature to training tf-idf, which is not what you want.
Not to mess up train/val split when adding new column, I'd recommend to do the split on the index of original dataframe
idx_train, idx_test = train_test_split(df.index, random_state=0)
X_train, y_train = df.loc[idx_train, 'text'], df.loc[idx_train, 'target']
# same for test
Then you can add your "id1" feature:
X_train_all_features = pd.concat([pd.DataFrame(X_train_tfidf.toarray()), df.loc[idx_train, 'id_1']], axis = 1)
UPDATE I don't see the reason to convert sparse matrix to pandas dataframe. It will be very slow with big enough dataset. Instead, add you feature to the matrix, so you can use it later in downstream algorithm.
from scipy.sparse import hstack
X_train_tfidf = hstack([X_train_tfidf, df.loc[idx_train, 'id1'].values.reshape(-1, 1)])
Check dimensions
X_train_tfidf.shape # should be (37, 501)
Upvotes: 3
Reputation: 3030
Ideally, you want to first add the new column and then do the splitting. If for some reason this is not suitable, I suggest the following:
You need the indices of the observations in X_train_tfidf in order to be able to get the corresponding values from df['id_1'] and thus can't simply concatenate the the entire df['id_1'] column to X_train_tfidf. Try replacing
X_train_all_features = pd.concat([pd.DataFrame(X_train_tfidf.toarray()), df['id_1']], axis = 1)
by the following code:
X_train_all_features = X_train_tfidf.copy()
X_train_all_features['id_1'] = df.loc[X_train_tfidf.index.values, 'id_1']
Let me know if this works.
Upvotes: 1