Reputation: 59
I am running TfIdfVectorizer on large data (ideally, I want to run it on all of my data which is a 30000 texts with around 20000 words each). Initially, I was using the default sklearn.feature_extraction.text.TfidfVectorizer
but I decided to run it on GPU so that it is faster. The result is quite the opposite - it is really, really slow! I am running the code on a Kaggle notebook with Tesla P100-PCIE-16GB
(very strong GPU).
You can check the two codes here:
Non-GPU implementation:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
df = pd.read_csv('data.csv')
X = df.input_text.astype(str).to_numpy()
print('Transforming...')
print(len(X))
model = TfidfVectorizer(lowercase=True, max_features=1000)
model.fit_transform(X)
GPU implementation:
import pandas as pd
from cuml.feature_extraction.text import TfidfVectorizer
import cudf
import numpy as np
df = pd.read_csv('data.csv')
X = df.input_text.astype(str).to_numpy()
X = cudf.Series(X)
print(X.shape)
print('Transforming...')
model = TfidfVectorizer(lowercase=True, max_features=1000)
model.fit_transform(X)
If you run these two pieces of code you will notice that the Non-GPU implementation is A LOT faster than the GPU implementation. Also, you can test this on Kaggle since they have very strong GPUs. My question is: why is this the case and how can I make use of the GPU to speed up the process?
Upvotes: 1
Views: 1413
Reputation: 11420
There are quite a few possible reasons why the implementations are different in their respective execution speed. The two most likely scenarios are the following
The GPU implementation by cuML is differing from the one by scikit-learn, and simply less efficient. This can be due to a variety of reasons: It could be due to a more "high-level" computation (compared to a very native implementation by scikit-learn), which would slow down parts of the operation; it could also be a transformation that is not possible to execute efficiently on a GPU (I am not super familiar with GPU computing, but I would assume that the varying lengths of texts doesn't play super nicely here).
A second reason might be due to the overhead of CPU/GPU-shuffle of the data. I have given a related answer here. In that context, we could observe that GPU computation itself was fairly fast, but the copying of data between memories causes so much overhead, that the data needs to be reasonably large before any performance increase can be observed.
My suggestion would be to inspect a profiling run of both a scikit-learn and cuML implementation, and see what functions your code spends the most time in. That way, you could probably see whether it is due to specific function call in cuML, or general GPU inefficiency.
Upvotes: 2