Is there any method for finding similarity between two corpora?

Question

I would like to measure similarity of two corpora. Similarity check I tried so far are the following:

Jaccard similarity
Dice's coefficient
Spearman's rank correlation coefficient
Chi2 test

regarding Spearman's rank correlation coefficient, the code is as follows;

def Spearman_rank_correlation_coefficient(another_word_freq_dict):
 num = 5120
 main_freq = list(min_word_freq_dict.keys())[:num]
 df_freq = list(another_word_freq_dict.keys())[:num]
 spearmen = []
 for i,word in enumerate(main_freq):
    i = i+1
    try:
        j = df_freq.index(word)+1
        spearmen.append((i-j)**2)
    except ValueError:
        j = num+1
        spearmen.append((i-j)**2)
 val = sum(spearmen)
 return 1 - (6*val)/(num**3 - num)

here, I took the top 5120 most frequent words from both the main corpus and another corpus. My question is in below the line of except ValueError, I assign 5121 as the ranking of a word which is not found in another corpus top 5120-word frequency list. Is this right procedure of handling when a word in the main corpus not found in another corpus in Spearman's Rank Correlation Coefficient?

Regarding Chi_2 test for good of fitness, I coded the following;

def chi2_test(another_word_freq_dict):
 num_words = 3000
 N1 = sum(main_word_freq_dict.values())
 main_ = dict([(key,val/N1) for key,val in main_word_freq_dict.items()])
 main_dict = dict([(k,main_[k]) for k in list(main_.keys())[:num_words]])
 another_dict = dict([(k,another_word_freq_dict[k]) for k in list(another_word_freq_dict.keys())[:num_words]])
 N_words = sum(another_dict.values()) #number of words appeared in another corpus
 N_unique_words = len(another_dict) # number of tokens in another corpus
 chi = []
 for word,expected in main_dict.items():
    try:
        observed = (another_dict[word]+1)/(N_words+N_unique_words) #laplace add-one smoothing
    except KeyError: #if a word in main is not in another courpus
        observed = 1/(N_words+N_unique_words)

    val = (expected - observed)**2/expected
    chi.append(val)

 return sum(chi)

My second question is does my code of chi2 test of function make any sense?

For the third question, is there any method to calculate the similarity between two corpora not words or sentences?

Is there any method for finding similarity between two corpora?

Answers (1)

Related Questions