Reputation: 329
hey so i have the following hurdle in my analysis of data.
I have two frequency lists contained in two seperate text files, that look like this:
list2.txt
325 de
309 het
308 is
289 een
258 ik
208 rt
207 op
192 :
189 van
186 met
178 echt
167 en
160 in
150 dat
list2.txt
528 het
471 ik
466 een
445 de
426 is
350 dat
308 niet
273 van
239 en
227 wat
215 die
199 je
193 met
188 op
180 in
166 te
155 voor
OPTION 1: I am looking for a way, preferably python, to perform the following equation on the following data. This is the formula i am trying to implement:
Pm(w) = relative frequency of word/token 'w' in list1
Pv(w) = relative frequency of word/token 'w' in list2
variance = sqrt (Pm(w) / Nm + Pv(w) / Nv)
t = ( Pm(w) - Pv(w)) / variance
Could somebody help me write a program/function that does this for me. i.e. it takes both text files as input, and produces a t value for each word/token. Im quite lost, and this seems to be taking me forever.
output: new document with t-test values and words.
OPTION2: i am also looking for a way that produces a ratio for me.
Input:(list1.txt and list2.txt)
Output: (list1-ratio.txt)
325 de 445 de 0.7:1
289 een 466 een 0.6:1
Output: (list2-ratio.txt)
445 de 325 de 1.3:1
466 een 289 een 1.6:1
Is there anyone that can help me with this, best case scenario would be to use both options, so i can compare data. This isnt homework, im working on sentiment analysis.
thanx
Upvotes: 1
Views: 547
Reputation: 12755
Here is an example using ttest_rel
from scipy.stats
. This performs a t-test for related data from two samples. In order to do such a test, assume a count of 0 for all words that are not in a list (e.g., "die" is in list2, but not in list1, so count of die in list1 is 0).
from scipy.stats import ttest_rel
def input_file_to_dict(f):
return dict((key, int(value)) for value, key in map(lambda line:line.split(), f))
with open("16892486/list1.txt") as f:
word_counts1 = input_file_to_dict(f)
with open("16892486/list2.txt") as f:
word_counts2 = input_file_to_dict(f)
#find all words that are in list1 and in list2
common_words = set.intersection(set(word_counts1.keys()), set(word_counts2.keys()))
t,p = ttest_rel([word_counts1[k] for k in common_words],
[word_counts2[k] for k in common_words])
For requirement two, we can then simple calculate the results we need and write it to a file:
with open("16892486/list1-ratio.txt","w") as f_out:
for word, count1, count2 in zip(all_words, counts1, counts2):
ratio = float(count1) / count2 if count2>0 else np.inf
print >>f_out, count1, word, count2, word, "%.2f:1" % ratio
Output in the file is then:
192 : 0 : inf:1
150 dat 350 dat 0.43:1
325 de 445 de 0.73:1
0 die 215 die 0.00:1
178 echt 0 echt inf:1
289 een 466 een 0.62:1
167 en 239 en 0.70:1
309 het 528 het 0.59:1
258 ik 471 ik 0.55:1
160 in 180 in 0.89:1
308 is 426 is 0.72:1
0 je 199 je 0.00:1
186 met 193 met 0.96:1
0 niet 308 niet 0.00:1
207 op 188 op 1.10:1
208 rt 0 rt inf:1
0 te 166 te 0.00:1
189 van 273 van 0.69:1
0 voor 155 voor 0.00:1
0 wat 227 wat 0.00:1
Notes:
Upvotes: 1