user1419243
user1419243

Reputation: 1705

Word Frequency Feature Normalization

I am extracting the features for a document. One of the features is the frequency of the word in the document. The problem is that the number of sentences in the training set and test set is not necessarily the same. So, I need to normalized it in some way. One possibility (that came to my mind) was to divide the frequency of the word by the number of sentences in the document. By my supervisor told me that it's better to normalize it in a logarithmic way. I have no idea what does that mean. Can anyone help me?

Thanks in advance,

PS: I also saw this topic, but it didn't help me.

Upvotes: 2

Views: 5597

Answers (4)

Seyma Kalay
Seyma Kalay

Reputation: 2863

tf-idf help to normalized -> check the results with tf and tf-idf arguments,
dtm <- DocumentTermMatrix(corpus);dtm

<> Non-/sparse entries: 27316/97548 Sparsity : 78% Maximal term length: 22 Weighting : term frequency (tf)

dtm <- DocumentTermMatrix(corpus,control = list(weighting=weightTfIdf));dtm

<> Non-/sparse entries: 24052/100812 Sparsity : 81% Maximal term length: 22 Weighting : term frequency - inverse document frequency (normalized) (tf-idf)

Upvotes: -1

Marcos Falcirolli
Marcos Falcirolli

Reputation: 11

Yes, there is a logarithm way, It's called TF-IDF.

TF-IDF is the product of the term frequency and the inverse document frequency.

TF-IDF = (total number of your word appers in the present document ÷ total number of words in the present document) * log(total numbers of documents in your collection ÷ the number of documents where your word appears in your collection )

If you use python there is a nice library called GENSIM that contains the algorithm, but your data object must be a Dictionary from the gensim.corpora.

You can find a example here: https://radimrehurek.com/gensim/models/tfidfmodel.html

Upvotes: 1

Tomer Levinboim
Tomer Levinboim

Reputation: 1012

'normalize it in a logarithmic way' probably simply means to replace the frequency feature by log(frequency).

One reason why taking the log might be useful is the Zipfian nature of word occurrences.

Upvotes: 2

CAFEBABE
CAFEBABE

Reputation: 4101

The first question to ask is, what algorithm you are using subsequently? For many algorithms it is sufficient to normalize the bag of words vector, such that it sums up to one or that some other norm is one.

Instead of normalizing by the number of sentence you should, however, normalize by the total number of words in the document. Your test corpus might have longer sentences, for example.

I assume the recommendation of your supervisor means that you do not report the counts of the words but the logarithm of the counts. In addition I would suggest to look into the TF/IDF measure in general. this is imho more common in Textmining

Upvotes: 2

Related Questions