Taki
Taki

Reputation: 45

Is there any method for finding similarity between two corpora?

I would like to measure similarity of two corpora. Similarity check I tried so far are the following:

regarding Spearman's rank correlation coefficient, the code is as follows;

def Spearman_rank_correlation_coefficient(another_word_freq_dict):
 num = 5120
 main_freq = list(min_word_freq_dict.keys())[:num]
 df_freq = list(another_word_freq_dict.keys())[:num]
 spearmen = []
 for i,word in enumerate(main_freq):
    i = i+1
    try:
        j = df_freq.index(word)+1
        spearmen.append((i-j)**2)
    except ValueError:
        j = num+1
        spearmen.append((i-j)**2)
 val = sum(spearmen)
 return 1 - (6*val)/(num**3 - num)

here, I took the top 5120 most frequent words from both the main corpus and another corpus. My question is in below the line of except ValueError, I assign 5121 as the ranking of a word which is not found in another corpus top 5120-word frequency list. Is this right procedure of handling when a word in the main corpus not found in another corpus in Spearman's Rank Correlation Coefficient?

Regarding Chi_2 test for good of fitness, I coded the following;

def chi2_test(another_word_freq_dict):
 num_words = 3000
 N1 = sum(main_word_freq_dict.values())
 main_ = dict([(key,val/N1) for key,val in main_word_freq_dict.items()])
 main_dict = dict([(k,main_[k]) for k in list(main_.keys())[:num_words]])
 another_dict = dict([(k,another_word_freq_dict[k]) for k in list(another_word_freq_dict.keys())[:num_words]])
 N_words = sum(another_dict.values()) #number of words appeared in another corpus
 N_unique_words = len(another_dict) # number of tokens in another corpus
 chi = []
 for word,expected in main_dict.items():
    try:
        observed = (another_dict[word]+1)/(N_words+N_unique_words) #laplace add-one smoothing
    except KeyError: #if a word in main is not in another courpus
        observed = 1/(N_words+N_unique_words)

    val = (expected - observed)**2/expected
    chi.append(val)

 return sum(chi)

My second question is does my code of chi2 test of function make any sense?

For the third question, is there any method to calculate the similarity between two corpora not words or sentences?

Upvotes: 0

Views: 375

Answers (1)

Oliver Mason
Oliver Mason

Reputation: 2270

You first need to decide how you define the similarity between two texts. Should this be

  1. the same words used with similar frequencies
  2. similar words in a similar order
  3. general overlap in vocabulary
  4. very few words not shared between the texts
  5. words used in the same contexts
  6. overlaps of multi-word sequences (n-grams)
  7. ...

There are different ways to define similarity, and until you know what you're looking for there is no point coming up with metrics. "Similarity" between texts is not an absolute, objective concept.

Upvotes: 1

Related Questions