Reputation: 459
I want to play around with text statistics, comparing texts pairwise by looking at relative frequencies of words in them (typically by computing the sum of absolute values of differences). This is O(n^2) in the number of texts, so precomputation within each text is ok. My question is about how to represent such statistics. I have tried two ways:
Vector (T.Text,Double)
sorted by hand (during precomputation), and given two such vectors, compute the sum by a recursive function. Kind of zip
keeping track of alignment of the first element of the pair, followed by a fold
.
Map T.Text Double
and then cooking up the same thing using mergeWithKey (\k x y -> Just abs (x-y)) id id
with a foldl' (+) 0
on top.
The second way is much more expressive, because a Map
is essentially what text statistics really are, and the code is much shorter. But on the other hand the Vector
is about 3 times faster, at the cost of a lot of verbosity, and somehow it feels wrong, like a naive implementation of a Map
. Of course it misses all the fancy insert / update / whatever, but I don't need that.
Am I missing something here, like a third data structure that would be better for the task?
Upvotes: 2
Views: 165
Reputation: 89043
Suppose you have two documents, both with O(m)
words. Then both your implementations take O(m log m)
time to compare the documents.
Sorting document 1 into a Vector (Text, Double) ~ O(m log m)
Sorting document 2 into a Vector (Text, Double) ~ O(m log m)
stepping through vector 1 and 2 ~ O(m)
total: O( m log m )
Storing document 1 in a Map Text Double ~ O(m log m)
Storing document 2 in a Map Text Double ~ O(m log m)
stepping through map 1 and map 2 ~ O(m log m)
total: O( m log m )
So your solutions are asymptotically equivalent, but this doesn't mean that both should have the same runtime. Testing against real data to see which has the smaller coefficient is the work of profiling, and is entirely appropriate at this point. The Vector solution may be less elegant, but it's thoroughly believable that it's more efficient.
After this point you could continue to optimize your run time by accepting approximations:
Upvotes: 2