Reputation: 45
I would like to measure similarity of two corpora. Similarity check I tried so far are the following:
regarding Spearman's rank correlation coefficient, the code is as follows;
def Spearman_rank_correlation_coefficient(another_word_freq_dict):
num = 5120
main_freq = list(min_word_freq_dict.keys())[:num]
df_freq = list(another_word_freq_dict.keys())[:num]
spearmen = []
for i,word in enumerate(main_freq):
i = i+1
try:
j = df_freq.index(word)+1
spearmen.append((i-j)**2)
except ValueError:
j = num+1
spearmen.append((i-j)**2)
val = sum(spearmen)
return 1 - (6*val)/(num**3 - num)
here, I took the top 5120 most frequent words from both the main corpus and another corpus. My question is in below the line of except ValueError, I assign 5121 as the ranking of a word which is not found in another corpus top 5120-word frequency list. Is this right procedure of handling when a word in the main corpus not found in another corpus in Spearman's Rank Correlation Coefficient?
Regarding Chi_2 test for good of fitness, I coded the following;
def chi2_test(another_word_freq_dict):
num_words = 3000
N1 = sum(main_word_freq_dict.values())
main_ = dict([(key,val/N1) for key,val in main_word_freq_dict.items()])
main_dict = dict([(k,main_[k]) for k in list(main_.keys())[:num_words]])
another_dict = dict([(k,another_word_freq_dict[k]) for k in list(another_word_freq_dict.keys())[:num_words]])
N_words = sum(another_dict.values()) #number of words appeared in another corpus
N_unique_words = len(another_dict) # number of tokens in another corpus
chi = []
for word,expected in main_dict.items():
try:
observed = (another_dict[word]+1)/(N_words+N_unique_words) #laplace add-one smoothing
except KeyError: #if a word in main is not in another courpus
observed = 1/(N_words+N_unique_words)
val = (expected - observed)**2/expected
chi.append(val)
return sum(chi)
My second question is does my code of chi2 test of function make any sense?
For the third question, is there any method to calculate the similarity between two corpora not words or sentences?
Upvotes: 0
Views: 375
Reputation: 2270
You first need to decide how you define the similarity between two texts. Should this be
There are different ways to define similarity, and until you know what you're looking for there is no point coming up with metrics. "Similarity" between texts is not an absolute, objective concept.
Upvotes: 1