Reputation: 311
I have this problem where I am using the hostnames of all the URLs I have in my dataset as features. I'm not able to figure out how to use TfidfVectorizer to extract hostnames only from the URLs and calculate their weights. For instance, I have a dataframe df where the column 'url' has all the URLs I need. I thought I had to do something like:
def preprocess(t):
return urlparse(t).hostname
tfv = TfidfVectorizer(preprocessor=preprocess)
tfv.fit_transform([t for t in df['url']])
It doesn't seem to work this way, since it splits the hostnames instead of treating them as whole strings. I think it's to do with analyzer='word' (which it is by default), which splits the string into words.
Any help would be appreciated, thanks!
Upvotes: 3
Views: 3176
Reputation: 16039
You are right. analyzer=word
creates a tokeniser that uses the default token pattern '(?u)\b\w\w+\b'
. If you wanted to tokenise the entire URL as a single token, you can change the token pattern:
vect = CountVectorizer(token_pattern='\S+')
This tokenises https://www.pythex.org hello hello.there
as ['https://www.pythex.org', 'hello', 'hello.there']
. You can then create an analyser to extract the hostname from URLs as shown in this question. You can either extend CountVectorizer
to change its build_analyzer
method or just monkey patch it:
def my_analyser():
# magic is a function that extracts hostname from URL, among other things
return lambda doc: magic(preprocess(self.decode(doc)))
vect = CountVectorizer(token_pattern='\S+')
vect. build_analyzer = my_analyser
vect.fit_transform(...)
Note: tokenisation is not as simple as is appears. The regex I've used has many limitations, e.g. it doesn't split the last token of a sentence and the first token of the next sentence if there isn't a space after the full stop. In general, regex tokenisers get very unwieldy very quickly. I recommend looking at nltk
, which offers several different non-regex tokenisers.
Upvotes: 5