Gonzo
Gonzo

Reputation: 137

Computational cost of making Anagrams from two strings using symmetric difference in python

Given two strings a, and b of lengths |a| and |b|, compute the number of elements that should be deleted in total such that a and b are anagrams of each other.

The answer is found by counting the number of elements not in the intersection of sets a and b.

What is the computational cost of doing such compuation using a.symmetric_difference(b)? documentation

Upvotes: 1

Views: 66

Answers (1)

Patrick87
Patrick87

Reputation: 28302

As pointed out in the comments, what you'd really need is a sort of symmetric difference for multisets. We can sketch out what an implementation of this could look like and see what the computational cost is.

A simple implementation for strings would be to create two frequency tables for the symbols in the strings, then construct a third frequency table for the symmetric difference.

Constructing the frequency table for each of the input strings is easy:

for each character in the string
    if the table contains an entry for the character
         increment the entry for the character
    otherwise
         create an entry for the character with frequency set to one

We may now assume we have our two frequency tables freq1 and freq2. To construct the symmetric difference we can loop through each table's keys and compare to the other table:

for each character in freq1.keys
    f1 = freq1[character]
    f2 = 0
    if freq2 has an entry for character then set f2 = freq2[character]
    if f1 >= f2 then symmdiff[character] = f1 - f2

for each character in freq2.keys
    f2 = freq2[character]
    f1 = 0
    if freq1 has an entry for character then set f1 = freq1[character]
    if f2 >= f1 then symmdiff[character] = f2 - f1

Our table symmdiff now contains enough information to answer your question: if we loop through once more and add up all the frequencies, that will tell us how many total characters must be deleted (though it does not tell us how many to delete from either table specifically, just the combined total; if we wanted to know from each table specifically, we could calculate two tables for each difference, rather than one for the symmetric difference)

sum = 0
for each character in symmdiff.keys
    sum = sum + symmdiff[character]

So, what is the computational complexity of this? The worst case is when each string has the same length and each consists of n distinct characters (so 2n distinct characters are used in total). Then, the time complexity is:

  1. two loops through the strings to build the maps, each of size proportional to n.
  2. two loops through the frequency tables and construction of a frequency table of size 2n.
  3. a loop through the frequency table of size 2n

Assuming O(1) table inserts/access and ignoring the true cost of adding numbers when computing the sum, this process is basically linear in both time and space. You can think of it as looping over the whole input (the pair of input strings) three times (6n iterations) in the worst case.

Upvotes: 1

Related Questions