Ganesh Sundar
Ganesh Sundar

Reputation: 311

sklearn - how to use TfidfVectorizer to use entire strings?

I have this problem where I am using the hostnames of all the URLs I have in my dataset as features. I'm not able to figure out how to use TfidfVectorizer to extract hostnames only from the URLs and calculate their weights. For instance, I have a dataframe df where the column 'url' has all the URLs I need. I thought I had to do something like:

def preprocess(t):
    return urlparse(t).hostname

tfv = TfidfVectorizer(preprocessor=preprocess)

tfv.fit_transform([t for t in df['url']])

It doesn't seem to work this way, since it splits the hostnames instead of treating them as whole strings. I think it's to do with analyzer='word' (which it is by default), which splits the string into words.

Any help would be appreciated, thanks!

Upvotes: 3

Views: 3176

Answers (1)

mbatchkarov
mbatchkarov

Reputation: 16039

You are right. analyzer=word creates a tokeniser that uses the default token pattern '(?u)\b\w\w+\b'. If you wanted to tokenise the entire URL as a single token, you can change the token pattern:

vect = CountVectorizer(token_pattern='\S+')

This tokenises https://www.pythex.org hello hello.there as ['https://www.pythex.org', 'hello', 'hello.there']. You can then create an analyser to extract the hostname from URLs as shown in this question. You can either extend CountVectorizer to change its build_analyzer method or just monkey patch it:

def my_analyser():
    # magic is a function that extracts hostname from URL, among other things
    return lambda doc: magic(preprocess(self.decode(doc)))

vect = CountVectorizer(token_pattern='\S+')
vect. build_analyzer = my_analyser
vect.fit_transform(...)

Note: tokenisation is not as simple as is appears. The regex I've used has many limitations, e.g. it doesn't split the last token of a sentence and the first token of the next sentence if there isn't a space after the full stop. In general, regex tokenisers get very unwieldy very quickly. I recommend looking at nltk, which offers several different non-regex tokenisers.

Upvotes: 5

Related Questions