Reputation: 48839
Clojure has a function sorted-set
which creates a PersistentTreeSet
object. As the name implies, sorted-set
creates a sorted collection of unique objects.
When are sorted sets useful? When is it better to use sorted-set
than sort
and distinct
?
=> (apply sorted-set [2 2 1 1 3 3])
#{1 2 3}
=> (sort (distinct [2 2 1 1 3 3]))
(1 2 3)
Upvotes: 5
Views: 1029
Reputation: 84369
Sorted sets are useful when you need set semantics – fast contains?
, conj
and disj
(= element removals), as explained by Leon – and traversals in a well-defined order. In the case of the built-in sorted sets (and maps), ordered traversals are possible over the entire set (seq
, rseq
) and any "subrange" (subseq
, rsubseq
) between two keys, inclusive or exclusive.
If you're willing to reach for out-of-core collections, the Contrib library data.avl (of which I am the author and maintainer) offers a flavour of sorted sets and maps with additional functionality – nth
for access to set elements by rank, rank-of
for discovering the rank of an element in the set, nearest neighbour queries, and "subrange" and split-like operations that return completely functional subsets of the input collection (think subseq
returning a completely functional subset of the original, not just a seq, without holding on to any elements of the original not present in the subset for the purposes of GC). All of these operate in O(log n) time worst-case, just like the standard sorted set operations.
If you only need contains?
+ conj
+ disj
, you'll probably want to use hash sets instead, since they tend to deliver better performance for these operations. It is worth noting, however, that if you anticitpate adding inputs from a possibly malicious outside source to your sets, you may want to go with sorted sets even if you don't care about the order. This is because hash sets' performance degrades to O(n) in the presence of hash collisions (which an adversary could force, the hash function in use being deterministic and fixed in advance), whereas sorted sets' O(log n) is a hard guarantee.
If you only need to sort your input collection once and then traverse it in whole, or various prefixes/suffixes of it, repeatedly, then building up a sorted vector of unique items may indeed be the better option. A sorted set may still be preferable even for a traversal-only workload, though, if you need the subseq
/rsubseq
feature of starting at an arbitrary element of the collection ((subseq a-set >= 5)
= seq over those elements of a-set
which are >= 5 with respect to a-set
's ordering).
Upvotes: 6
Reputation: 9276
The difference between a sorted set and the result of calling sort
and distinct
is that the resulting type is a set.
This gives you O(log N) performance (think binary search) to check whether an element is in the collection (contains?
) or to add one (conj
), while on a list, returned by sort
and distinct
you'd get worse characteristics to achieve the same behavior by default.
Upvotes: 1
Reputation: 191953
Personally, I use sorted sets if I want an ordered data structure without duplicates as I add elements in. That being said, I start with an empty set rather than apply it to a list.
The time I would use sort and distinct is if I had any other data structure like a list that I want to order and remove duplicates.
Basically, applying a set gives you a new object with unique elements while distinct acts on the same list reference.
Upvotes: 1