Reputation: 866
I am vectorizing a text blob with tokens that have the following style:
hi__(how are you), 908__(number code), the__(POS)
As you can see the tokens have attached some information with __(info)
, I am extracting key words using tfidf, as follows:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(doc)
indices = np.argsort(vectorizer.idf_)[::-1]
features = vectorizer.get_feature_names()
The problem is that when I do the above procedure for extracting keywords, I am suspecting that the vectorizer object is removing the parenthesis from my textblob. Thus, which parameter from the tfidf vectorizer object can I use in order to preserve such information in the parenthesis?
UPDATE
I also tried to:
from sklearn.feature_extraction.text import TfidfVectorizer
def dummy_fun(doc):
return doc
tfidf = TfidfVectorizer(
analyzer='word',
tokenizer=dummy_fun,
preprocessor=dummy_fun,
token_pattern=None)
and
from sklearn.feature_extraction.text import TfidfVectorizer
def dummy_fun(doc):
return doc
tfidf = TfidfVectorizer(
tokenizer=dummy_fun,
preprocessor=dummy_fun,
token_pattern=None)
However, this returns me a sequence of characters instead of tokens that I already tokenize:
['e', 's', '_', 'a', 't', 'o', 'c', 'r', 'i', 'n']
Upvotes: 1
Views: 2328
Reputation: 3113
The problem is that default tokenization used by TfidfVectorizer
explicitly ignores all punctuation:
token_pattern : string
Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'. The default regexp selects tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).
Your problem is related to this previous question but instead of treating punctuation as separate tokens, you want prevent token__(info)
from splitting the token. In both cases, the solution is to write a custom token_pattern
, although exact patterns are different.
Assuming every token already has __(info)
attached:
vectorizer = TfidfVectorizer(token_pattern=r'(?u)\b\w\w+__\([\w\s]*\)')
X = vectorizer.fit_transform(doc)
I simply modified the default token_pattern
so it now matches any 2 or more alphanumeric characters followed by __(
, 0 or more alphanumeric or whitespace characters, and ending with a )
. If you want more information on how to write your own token_pattern
, see the Python doc for regular expressions.
Upvotes: 2