TfIdfVectorizer is slower on GPU (cuml vs sklearn implementation)

Question

I am running TfIdfVectorizer on large data (ideally, I want to run it on all of my data which is a 30000 texts with around 20000 words each). Initially, I was using the default sklearn.feature_extraction.text.TfidfVectorizer but I decided to run it on GPU so that it is faster. The result is quite the opposite - it is really, really slow! I am running the code on a Kaggle notebook with Tesla P100-PCIE-16GB (very strong GPU). You can check the two codes here: Non-GPU implementation:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
df = pd.read_csv('data.csv')
X = df.input_text.astype(str).to_numpy()
print('Transforming...')
print(len(X))
model = TfidfVectorizer(lowercase=True, max_features=1000)
model.fit_transform(X)

GPU implementation:

import pandas as pd
from cuml.feature_extraction.text import TfidfVectorizer
import cudf
import numpy as np


df = pd.read_csv('data.csv')
X = df.input_text.astype(str).to_numpy()
X = cudf.Series(X)
print(X.shape)
print('Transforming...')
model = TfidfVectorizer(lowercase=True, max_features=1000)
model.fit_transform(X)

If you run these two pieces of code you will notice that the Non-GPU implementation is A LOT faster than the GPU implementation. Also, you can test this on Kaggle since they have very strong GPUs. My question is: why is this the case and how can I make use of the GPU to speed up the process?

TfIdfVectorizer is slower on GPU (cuml vs sklearn implementation)

Answers (1)

Related Questions