Chidananda Nayak
Chidananda Nayak

Reputation: 1201

TFIDF and Multilingual Text Classification

I have a scenario, there is a store that has video contents of different languages including English. I want to give an item to item recommendation using TFIDF, but I am confused with stop words. How it is going to perform in diversified languages? And what should be the stop_word?

tftdf = TfidfVectorizer()
count_matrix = tftdf.fit_transform(df["combined_features"])
cosine_sim = cosine_similarity(count_matrix)

Upvotes: 0

Views: 1939

Answers (2)

crosslingual
crosslingual

Reputation: 21

Try Text2Text to get the TFIDF vectors. It supports 100s of languages.

No need to worry about stop words or stemming as described in this paper.

Upvotes: 1

Anna Maule
Anna Maule

Reputation: 260

Stop Words is a set of common used words that rather add more noise to the text than useful information. Common stop words in English are: a, the, in, an, and punctuation can also be a stop word.

Some libraries suck as NLTK already have established sets of stop words for English. Example:

import nltk
from nltk.corpus import stopwords
set(stopwords.words('english'))

You can also customize your stop word list based on the context of the NLP application that you are building.

Each language will have a different set of stop words, an English set of stop words would look like this:

english_stop_words = ["the","a","an","it","by","or",...]

while a Portuguese stop word list would look like this:

portuguse_stop_words = ["a", "o","um","uma","pelo", "pela","ou",...]

a French set of stop words could be:

french_stop_words = ["le","la", "à","alors","ce",...]

So for each language you will need a specific set of stop word for that language. Not necessarily a straight up translation from a stop words set from one language to another.

Again, this is all relative to the purpose of your application. Stop words are used in the pre-processing step of your Natural Language Processing pipeline as a noise reduction step.

Here is a website that has a list of stop words for several languages.

Good luck :)

Upvotes: 0

Related Questions