Reputation: 8297
I reading some text data from a csv and trying to build a TF-IDF feature vector using those data.
The data looks something like:
where the content contains specially formatted strings (synset).
When I try to build a TF-IDF vector with that, I am expecting to preserve that format, but when I do
tfidf = TfidfVectorizer()
data['content'] = data['content'].fillna('')
tfidf_matrix = tfidf.fit_transform(data['content'])
and look at the tfidf.vocabulary_
The text data is preprocessed as:
{'square': 3754,
'01': 0,
'02': 1,
'public_square': 3137,
'04': 3,
'05': 4,
'06': 5,
'07': 6,
'08': 7,
'03': 2,
'feather': 1666,
'straight': 3821,...
I want it to count square.n.01
as a single text instead of splitting it up.
I would be able to do this if I build TF-IDF from scratch, but I feel like that is unnecessary. Any help?
Upvotes: 1
Views: 664
Reputation: 2868
you need to write your own tokenization function which need to be called in tokenizer parameter of tfidfVectorizer
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
df = pd.DataFrame(data = [[['square.n.01','square.n.02','public_square.n.01']],
[['two.n.01','deuce.n.04','two.s.01']]], columns = ['content'])
df['content'] = df['content'].astype(str)
df['content'] = df['content'].apply(lambda x: x.replace('[','').replace(']',''))
def my_tokenizer(doc):
return doc.split(',')
tfidf = TfidfVectorizer(tokenizer = my_tokenizer)
tfidf_matrix = tfidf.fit_transform(df['content'])
print(tfidf.vocabulary_)
#o/p
{"'square.n.01'": 4,
" 'square.n.02'": 2,
" 'public_square.n.01'": 1,
"'two.n.01'": 5,
" 'deuce.n.04'": 0,
" 'two.s.01'": 3}
Upvotes: 2