Vincent Beffara
Vincent Beffara

Reputation: 459

Performance : computing in Haskell using Vector (a,b) versus Map a b

I want to play around with text statistics, comparing texts pairwise by looking at relative frequencies of words in them (typically by computing the sum of absolute values of differences). This is O(n^2) in the number of texts, so precomputation within each text is ok. My question is about how to represent such statistics. I have tried two ways:

The second way is much more expressive, because a Map is essentially what text statistics really are, and the code is much shorter. But on the other hand the Vector is about 3 times faster, at the cost of a lot of verbosity, and somehow it feels wrong, like a naive implementation of a Map. Of course it misses all the fancy insert / update / whatever, but I don't need that.

Am I missing something here, like a third data structure that would be better for the task?

Upvotes: 2

Views: 165

Answers (1)

rampion
rampion

Reputation: 89043

Suppose you have two documents, both with O(m) words. Then both your implementations take O(m log m) time to compare the documents.

Sorting document 1 into a Vector (Text, Double) ~ O(m log m)
Sorting document 2 into a Vector (Text, Double) ~ O(m log m)
stepping through vector 1 and 2 ~ O(m)
total: O( m log m )

Storing document 1 in a Map Text Double ~ O(m log m)
Storing document 2 in a Map Text Double ~ O(m log m)
stepping through map 1 and map 2 ~ O(m log m)
total: O( m log m )

So your solutions are asymptotically equivalent, but this doesn't mean that both should have the same runtime. Testing against real data to see which has the smaller coefficient is the work of profiling, and is entirely appropriate at this point. The Vector solution may be less elegant, but it's thoroughly believable that it's more efficient.

After this point you could continue to optimize your run time by accepting approximations:

  • choose a set of words of size p you care about, p < m, and just examine the word counts over that domain for a runtime of O(p log p)
  • hash individual words to integer indicies, and ignore collisions for a runtime of O(m) (assuming constant time word hashing)

Upvotes: 2

Related Questions