Reputation: 10149
Working on below problem of grouping anagram. My current solution is sorting each word by individual character, then map the same sorted value into a dictionary.
Wondering if any better ideas which has less algorithm time complexity? I am thinking about ways not doing sorting, like hashing, but hashing requires order character of words as well.
Post the problem and my code, written in Python 2.7.
Problem,
Give a list of words like [rats, star, arts, cie, ice], group same anagrams into buckets and output them. [rats, star, arts] [cie, ice]
Source code,
from collections import defaultdict
def group_anagram(anagrams):
result = defaultdict(list)
for a in anagrams:
result[''.join(sorted(list(a)))].append(a)
return result
if __name__ == "__main__":
anagrams = ['rats', 'star', 'arts', 'cie', 'ice']
print group_anagram(anagrams)
Upvotes: 3
Views: 764
Reputation: 52008
Your current method is probably best. To test things, I used your method, the method from the excellent answer of @bigballer and a third method which uses a tuple of counts as the key. To stress-test the methods, I used them on the massive (264,097 words) word list yawl, running each function 100 times, and calculating the average time for each approach:
from collections import defaultdict
import timeit
def group_anagram1(anagrams):
result = defaultdict(list)
for a in anagrams:
result[''.join(sorted(a))].append(a)
return result.values()
primes = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101]
def group_anagram2(anagrams):
result = defaultdict(list)
for a in anagrams:
n = 1
for c in a:
n *= primes[ord(c) - ord('a')]
result[n].append(a)
return result.values()
def group_anagram3(anagrams):
result = defaultdict(list)
for a in anagrams:
counts = [0]*26
for c in a:
counts[ord(c) - ord('a')] += 1
result[tuple(counts)].append(a)
return result.values()
with open("yawl.txt") as f:
words = f.readlines()
words =[w.strip() for w in words]
print timeit.timeit("group_anagram1(words)", setup="from __main__ import group_anagram1,words",number = 100)/100.0
print timeit.timeit("group_anagram2(words)", setup="from __main__ import group_anagram2,words",number = 100)/100.0
print timeit.timeit("group_anagram3(words)", setup="from __main__ import group_anagram3,words",number = 100)/100.0
Output (on my machine YMMV):
0.486009083239
0.64333226691
0.797640375079
Really, all methods are pretty dang fast given the size of yawl
, each taking less than a second to process over a quarter of a million words. Nevertheless, your original method is the clear winner. Furthermore, it isn't restricted to the Latin 'a'
to 'z'
alphabet. As to why it is best -- the key is directly constructed by Python built-ins (which run optimized C code) but the other methods use interpreted Python code. It is hard to beat built-ins.
On Edit: I re-implemented the second method using this list of primes, sorted so that more frequent letters (in the English language) are assigned smaller primes:
primes = [5,71,37,29,2,53,59,19,11,83,79,31,43,13,7,67,97,23,17,3,41,73,47,89,61,101]
It shaves a fraction of a second off the time, but not nearly enough to make it faster than the first method.
On Further Edit:
I reran the above code with the following tweak to the second method (as suggested by @bigballer ):
primes = [5,71,37,29,2,53,59,19,11,83,79,31,43,13,7,67,97,23,17,3,41,73,47,89,61,101]
primes = {c:p for c,p in zip('abcdefghijklmnopqrstuvwxyz',primes)}
def group_anagram2(anagrams):
result = defaultdict(list)
for a in anagrams:
n = 1
for c in a:
n *= primes[c]
result[n].append(a)
return result.values()
With this version, the first two methods drop to a virtual tie, with the prime-based method slightly faster in my somewhat limited tests (faster by about 8%). Nevertheless, I still think that the first method is preferable since it doesn't depend on the fixed alphabet.
Upvotes: 3
Reputation: 146
prime factorisation is unique, and order of multiplication doesn't matter.
You could assign a = 2, b = 3, c = 5, d = 7
etc.
then dab = 7 * 2 * 3 = 42 = 3 * 2 * 7 = bad, so your hash would be 42.
another option is an efficient implementation of hash(frozenset(collections.Counter(word).items()))
edit: probably the fastest would be to use 26 bits. For each character in the word, flip the bit corresponding to it. You may get some collisions, in which case you can dedupe on lookup
Upvotes: 4