ItsMeMario
ItsMeMario

Reputation: 123

Combine additional data to my TFIDF array

I'm trying to create a text classification model using scikit-learn. At first, I was using only the text's tfidf array as a feature. The structure of my dataset can be seen below (the dataset is stored in a pandas dataframe called df):

>>>df.head(2)

       id_1    id_2    id_3    target    text
       11      454     320     197       some text here
       15      440     111     205       text goes here too

>>>df.info()

    Data columns (total 5 columns):
     #   Column    Non-Null Count   Dtype 
    ---  ------    --------------   ----- 
     0   id_1      500 non-null     uint16
     1   id_2      500 non-null     uint16
     2   id_3      500 non-null     uint16
     3   target    500 non-null     uint16
     4   text      500 non-null     object

So, I split the train/test datasets and proceeded with creating the tfidf vector and transforming the data for training and testing.

X_train, X_test, y_train, y_test = train_test_split(df['text'], df['target'], random_state=0)

vectorizer = TfidfVectorizer(max_features=500, decode_error="ignore", ngram_range=(1, 2))
vectorizer.fit(X_train)

X_train_tfidf = vectorizer.transform(X_train)
X_test_tfidf  = vectorizer.transform(X_test)

So far apparently the code is working ok. However, there was a need to improve the algorithm, including yet another feature. For this improvment, I want to add the id_1 column to my features (it can be an important information to our ML model). So, in addition to my tfidf matrix, I would like to add this column (id_1) with my new feature, so that I can pass it as a parameter to train the model.

What I have tried:

X_train, X_test, y_train, y_test = train_test_split(df['text'], df['target'], random_state=0)

vectorizer = TfidfVectorizer(max_features=500, decode_error="ignore", ngram_range=(1, 2))
vectorizer.fit(X_train)

X_train_tfidf = vectorizer.transform(X_train)
X_test_tfidf  = vectorizer.transform(X_test)

X_train_all_features = pd.concat([pd.DataFrame(X_train_tfidf.toarray()), df['id_1']], axis = 1)

So, the shape of my structure is

>>>print(X_train_tfidf.shape)

(37, 500) # as expected (I'm loading 50 lines, so this is about 75%)

>>>print(X_train_all_features.shape)

(50, 501) # n of columns is expected, but not the lines, because the df[id_1] was not splited in train_test_split function

In a nutshel, I want pass to my ML algorithm something like the image below - my tfidf vector and my id_1 features:

tfidf concat id_1

I feel that I am missing something extremely basic, but even with all the research I have been able to solve my problem satisfactorily. I'm honestly lost in that part of the problem and I don't know how to evolve from here

Upvotes: 0

Views: 2336

Answers (2)

antonms
antonms

Reputation: 61

Your df has 50 rows and X_train_tfidf 37, pd.concat() returns dataframes with 50 rows, with remaining 13 filled with NaN.

You added all values of your feature to training tf-idf, which is not what you want.

Not to mess up train/val split when adding new column, I'd recommend to do the split on the index of original dataframe

idx_train, idx_test = train_test_split(df.index, random_state=0)
X_train, y_train = df.loc[idx_train, 'text'],  df.loc[idx_train, 'target']
# same for test

Then you can add your "id1" feature:

X_train_all_features = pd.concat([pd.DataFrame(X_train_tfidf.toarray()), df.loc[idx_train, 'id_1']], axis = 1)

UPDATE I don't see the reason to convert sparse matrix to pandas dataframe. It will be very slow with big enough dataset. Instead, add you feature to the matrix, so you can use it later in downstream algorithm.

from scipy.sparse import hstack 
X_train_tfidf = hstack([X_train_tfidf, df.loc[idx_train, 'id1'].values.reshape(-1, 1)])

Check dimensions

X_train_tfidf.shape # should be (37, 501)

Upvotes: 3

Michael Hodel
Michael Hodel

Reputation: 3030

Ideally, you want to first add the new column and then do the splitting. If for some reason this is not suitable, I suggest the following:

You need the indices of the observations in X_train_tfidf in order to be able to get the corresponding values from df['id_1'] and thus can't simply concatenate the the entire df['id_1'] column to X_train_tfidf. Try replacing

X_train_all_features = pd.concat([pd.DataFrame(X_train_tfidf.toarray()), df['id_1']], axis = 1)

by the following code:

X_train_all_features = X_train_tfidf.copy()
X_train_all_features['id_1'] = df.loc[X_train_tfidf.index.values, 'id_1']

Let me know if this works.

Upvotes: 1

Related Questions