Mulone
Mulone

Reputation: 3663

How to extract semantic relatedness from a text corpus

The goal is to assess semantic relatedness between terms in a large text corpus, e.g. 'police' and 'crime' should have a stronger semantic relatedness than 'police' and 'mountain' as they tend to co-occur in the same context.

The simplest approach I've read about consists of extracting IF-IDF information from the corpus.

A lot of people use Latent Semantic Analysis to find semantic correlations.

I've come across the Lucene search engine: http://lucene.apache.org/

Do you think it is suitable to extract IF-IDF?

What would you recommend to do what I'm trying to do, both in terms of technique and software tools (with a preference for Java)?

Thanks in advance!

Mulone

Upvotes: 1

Views: 1496

Answers (2)

yura
yura

Reputation: 14645

It is very easy if you have lucene index. For example to get correllation you can use simple formula count(term1 and term2)/ count(term1)* count(term2). Where count is hits from you search results. Moreover you can easility calculate other semntica metrics such as chi^2, info gain. All you need is to get formula and convert it to terms of count from Query

Upvotes: 0

Xodarap
Xodarap

Reputation: 11849

Yes, Lucene gets TF-IDF data. The Carrot^2 algorithm is an example of a semantic extraction program built on Lucene. I mention it since, as a first step, they create a correlation matrix. Of course, you probably can build this matrix yourself easily.

If you deal with a ton of data, you may want to use Mahout for the harder linear algebra parts.

Upvotes: 0

Related Questions