user1002288
user1002288

Reputation: 5040

How to find find anagrams among words, which are given in a file

How to find find anagrams among words, which are given in a file.

My solution:

Sort them and then find duplicates.

O(n mlgm). n: number of words, m : max size of the word

Any better solutions ?

thanks

Upvotes: 3

Views: 4798

Answers (5)

Soudipta Dutta
Soudipta Dutta

Reputation: 2142

# list of words
words = ["ROOPA","TABU","OOPAR","BUTA","BUAT" , "PAROO","Soudipta",
        "Kheyali Park", "Tollygaunge", "AROOP","Love","AOORP",
         "Protijayi","Paikpara","dipSouta","Shyambazaar",
        "jayiProti", "North Calcutta", "Sovabazaar"]

#Method 1
A = [''.join(sorted(word)) for word in words]

dict ={}

for indexofsamewords,samewords in enumerate(A):
    dict.setdefault(samewords, []).append(indexofsamewords)
    
print(dict)
#{'AOOPR': [0, 2, 5, 9, 11], 'ABTU': [1, 3, 4], 'Sadioptu': [6, 14], ' KPaaehiklry': [7], 'Taeggllnouy': [8], 'Leov': [10], 'Paiijorty': [12, 16], 'Paaaikpr': [13], 'Saaaabhmryz': [15], ' CNaachlortttu': [17], 'Saaaaborvz': [18]}

for index in dict.values(): 
    print( [words[i] for i in index ] )
    

The Output :

['ROOPA', 'OOPAR', 'PAROO', 'AROOP', 'AOORP']
['TABU', 'BUTA', 'BUAT']
['Soudipta', 'dipSouta']
['Kheyali Park']
['Tollygaunge']
['Love']
['Protijayi', 'jayiProti']
['Paikpara']
['Shyambazaar']
['North Calcutta']
['Sovabazaar']

Upvotes: 0

ACV
ACV

Reputation: 10560

This is a solution without sorting: I came up with a new solution I guess. It uses the Fundamental Theorem of Arithmetic. So the idea is to use an array of the first 26 prime numbers. Then for each letter in the input word we get the corresponding prime number A = 2, B = 3, C = 5, D = 7 … and then we calculate the product of our input word. Next we do this for each word in the dictionary and if a word matches our input word, then we add it to the resulting list. All anagrams will have the same signature because

Any integer greater than 1 is either a prime number, or can be written as a unique product of prime numbers (ignoring the order).

Here's the code. I convert the word to UPPERCASE and 65 is the position of A which corresponds to my first prime number:

private int[] PRIMES = new int[] { 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31,
        37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103,
        107, 109, 113 };

This is the function:

 private long calculateProduct(char[] letters) {
    long result = 1L;
    for (char c : letters) {
        if (c < 65) {
            return -1;
        }
        int pos = c - 65;
        result *= PRIMES[pos];
    }
    return result;
}

Complete description is available here: Anagram on dev.vvirlan.com

Upvotes: 9

GeekSince1982
GeekSince1982

Reputation: 731

It's an old topic, but I'll post it in case someone stumbles upon this:

I've made a description of the process done in google spreadsheets (can be done in excel too). It's a very simple method.

https://i.sstatic.net/hS1Zr.jpg

Basically you take you list of string and disassemble each string into letters. You take each "letters package" and sort them alphabetically. Assemble back into words but letters are alphabetically sorted. Sort on that assembled "words" - all anagrams land next to each other. Make a simple formula to mark anagrams.

Upvotes: -1

Fred Foo
Fred Foo

Reputation: 363737

Better solution: assume that words have some small average length. Ask your local linguist for a reference if necessary. Then apply the algorithm you had in mind; if it's the one I have in mind, it will mathemagically have expected linear time performance, in the number of words.

Upvotes: -1

Adam Rosenfield
Adam Rosenfield

Reputation: 400454

Hash all of the words using a hash function that is invariant under permutations of a word, e.g. compute the frequency count of each letter and hash that array. Put these in a hash table and look for entries that hash to the same value (of course, you still have to test if those collisions are actual anagrams, due to the nature of hash tables).

That should run in O(n) time, assuming you choose a good hash function and your input set does not contain too many anagrams (in the worst case, if every word is an anagram of every other word, this runs in O(n2) time).

Upvotes: 6

Related Questions