Is array preferred over set or map?

Question

I recently interviewed with a company in the bay area (CA,USA). One of the questions was to simply find if a string has repeated characters(I have simplified a lengthy question).

eg:
input : "qwerrty"
output : True

I used python to code this.

I gave a solution that uses a set to track the elements encountered during the iteration.

However the interviewer wanted me to use an array[255] that tracks the characters encountered.

Although I was quite comfortable using either of them, my opinion was to use a set simply because we are wasting 255 character space when we use an array. This is because (as we all know) initially we create an arr[255] = 0 all elements being zero and then increment the ASCII equivalent index value by 1.

A set on the other hand would spend memory only on the elements visited.

Since he (kind of) argued to use an array over a set I am curious to know if he was technically correct. Is array preferred over a set/map in this case? If so, why?

templatetypedef · Accepted Answer

One thing to notice about this question is that if there are only C possible distinct characters that can be in the string, then for any string you get of length C+1 or greater you can automatically return that a duplicate exists without even looking at the string because there are too many characters for them to all be unique (this is the pigeonhole principle at work). This is important for thinking about the structure of this particular problem.

Next, notice that you don't even need a bunch of counters. You can just get away with one bit per character, since you just need to know whether you've never seen a character (0) or seen it before (1) when you are iterating across the array. That means that you need one bit per character. If your word size is W, this means you need roughly C / W total machine words of storage space for the array-based solution.

Let's imagine that you're working with C = 256 (say, for example, each character is a one-byte value) on a machine with a 32-bit word size (W = 32). This means that you need eight machine words to store the bit array, which is a negligible amount of storage space and can easily be initialized to 0. Now, think about your set implementation. If you use a hash table, there will be some sort of internal array used to store everything. You also need space to store information about the hash function, and usually you'd cache the size of the set somewhere. That's going to eat up something like three machine words just for the size and hash function info, which leaves you five words of space. If the hash table is implemented generically and each entry uses up one machine word, then your approach only saves space if you have a hash table of four entries or less, which is unlikely to happen. If your hash table is optimized and stores char values directly, then you can store up to five words' worth of chars (20 chars) without any collisions, but if you tried to keep the load factor low you'd probably resize the table after you saw 10 or so chars. So in short, unless you have a very short string, the hash table approach probably will use more memory, and the overhead of the hashing will be high. The array approach is likely faster.

On the other hand, imagine that you're storing arbitrary Unicode characters in the string. Now, C = 1,114,112 (thanks, Wikipedia), and even with a 64-bit word size you're talking about needing an array of 17,408 machine words to store one bit per possible character. That's a lot of storage space and it's going to take a while to initialize it. Now, if the strings you're getting as input are "reasonable" and not pathologically-constructed, chances are you're going to find a duplicate element pretty early on in the string (if the string is totally random, then by the birthday paradox you'll only need √(2C) characters before you'll get a duplicate, on average), so building a hash table will likely require a lot less space. If the strings are pathologically constructed so that every character is unique, though, the constant factor overhead from the hash functions being computed, the hash table resizing, etc. will likely mean that your approach will be slower than the array-based one, but that's an unusual use case.

To summarize:

If the number of possible characters is small (think ASCII), the array-based approach is likely going to be a lot faster and more memory-efficient.
If the number of possible characters is large (think Unicode), the array-based approach is likely going to be slower and less memory-efficient on reasonable inputs, but for pathologically-chosen inputs may potentially be faster than the hash-based approach.

Now, that said, you could argue that unless the code is run in a tight loop, anything other than "just use a set" makes the code hard to read for a minimal benefit to the overall program efficiency. For that reason, a reasonable answer would be "use the set unless there's a reason not to, and then switch to the array-based one only if the data supports it."

Is array preferred over set or map?

Answers (2)

Related Questions