Combine additional data to my TFIDF array

Question

I'm trying to create a text classification model using scikit-learn. At first, I was using only the text's tfidf array as a feature. The structure of my dataset can be seen below (the dataset is stored in a pandas dataframe called df):

>>>df.head(2)

       id_1    id_2    id_3    target    text
       11      454     320     197       some text here
       15      440     111     205       text goes here too

>>>df.info()

    Data columns (total 5 columns):
     #   Column    Non-Null Count   Dtype 
    ---  ------    --------------   ----- 
     0   id_1      500 non-null     uint16
     1   id_2      500 non-null     uint16
     2   id_3      500 non-null     uint16
     3   target    500 non-null     uint16
     4   text      500 non-null     object

So, I split the train/test datasets and proceeded with creating the tfidf vector and transforming the data for training and testing.

X_train, X_test, y_train, y_test = train_test_split(df['text'], df['target'], random_state=0)

vectorizer = TfidfVectorizer(max_features=500, decode_error="ignore", ngram_range=(1, 2))
vectorizer.fit(X_train)

X_train_tfidf = vectorizer.transform(X_train)
X_test_tfidf  = vectorizer.transform(X_test)

So far apparently the code is working ok. However, there was a need to improve the algorithm, including yet another feature. For this improvment, I want to add the id_1 column to my features (it can be an important information to our ML model). So, in addition to my tfidf matrix, I would like to add this column (id_1) with my new feature, so that I can pass it as a parameter to train the model.

What I have tried:

X_train, X_test, y_train, y_test = train_test_split(df['text'], df['target'], random_state=0)

vectorizer = TfidfVectorizer(max_features=500, decode_error="ignore", ngram_range=(1, 2))
vectorizer.fit(X_train)

X_train_tfidf = vectorizer.transform(X_train)
X_test_tfidf  = vectorizer.transform(X_test)

X_train_all_features = pd.concat([pd.DataFrame(X_train_tfidf.toarray()), df['id_1']], axis = 1)

So, the shape of my structure is

>>>print(X_train_tfidf.shape)

(37, 500) # as expected (I'm loading 50 lines, so this is about 75%)

>>>print(X_train_all_features.shape)

(50, 501) # n of columns is expected, but not the lines, because the df[id_1] was not splited in train_test_split function

In a nutshel, I want pass to my ML algorithm something like the image below - my tfidf vector and my id_1 features:

I feel that I am missing something extremely basic, but even with all the research I have been able to solve my problem satisfactorily. I'm honestly lost in that part of the problem and I don't know how to evolve from here

antonms · Accepted Answer

Your df has 50 rows and X_train_tfidf 37, pd.concat() returns dataframes with 50 rows, with remaining 13 filled with NaN.

You added all values of your feature to training tf-idf, which is not what you want.

Not to mess up train/val split when adding new column, I'd recommend to do the split on the index of original dataframe

idx_train, idx_test = train_test_split(df.index, random_state=0)
X_train, y_train = df.loc[idx_train, 'text'],  df.loc[idx_train, 'target']
# same for test

Then you can add your "id1" feature:

X_train_all_features = pd.concat([pd.DataFrame(X_train_tfidf.toarray()), df.loc[idx_train, 'id_1']], axis = 1)

UPDATE I don't see the reason to convert sparse matrix to pandas dataframe. It will be very slow with big enough dataset. Instead, add you feature to the matrix, so you can use it later in downstream algorithm.

from scipy.sparse import hstack 
X_train_tfidf = hstack([X_train_tfidf, df.loc[idx_train, 'id1'].values.reshape(-1, 1)])

Check dimensions

X_train_tfidf.shape # should be (37, 501)

Combine additional data to my TFIDF array

Answers (2)

Related Questions