Reputation: 53
Hi I'm attempting to solve this interview question problem:
String Compression: Implement a method to perform basic string compression using the counts of repeated characters. For example, the string
"aabcccccaaa"
would become"a2b1c5a3"
. If the "compressed" string would not become smaller than the original string, your method should return the original string. You can assume the string has only uppercase and lowercase letters (a - z).
It would be easy to solve this problem by creating a new string or just printing out the result but I'm trying to come up with an in-place solution. Basically, I'm using string::erase()
and string::insert()
function. First, I count up the number of characters present and store it in int
variable count
, then if a number is being repeated, I simply erase it and insert the count
value in there. If a character is not being repeated, then I simply insert 1 since the default value of count
is set to 1. Here is the code:
string stringCompression(string A) {
int index = 0; char character = ''; int count = 1; string convertInt = ""; int initial_index = 0, length = A.length();
do {
character = A[index];
initial_index = index;
index++;
while (A[index] == character)
{
count++;
index++;
}
convertInt = to_string(count);
if (count == 1)
{
A.insert(index, convertInt);
index++;
}
else
{
A.erase((initial_index+1), (index-1));
A.insert(initial_index+1, convertInt);
index = initial_index + 2;
count = 1;
}
} while (A[index]!='\0');
return A;
}
int main()
{
cout << stringCompression("aabcccccaaa");
}
Output:
a2b1c5
The expected result would be "a2b1c5a3"
. But the last set of 3 a
's do not appear and I tried debugging it. By the values I get this code should work. But strangely, when I debug, the erase()
function automatically deletes the last 3 a
's even though the values of Index
and initial_index
appear to be correct. Can someone please explain why this is happening?
Upvotes: 2
Views: 1743
Reputation: 11116
Giving an algorithm that performs the conversion in-place without requiring superlinear time complexity is somewhat tricky, but hardly impossible:
Iterate over the string once to compute if it can be compressed via the given method. To do so, simple compute the length of the original string (if it is not statically known) and use your method to count how many characters each sequence would cost (e.g., 'abbcccccccccc' starts with the 'a's, of which there is one, so it will require 2 characters, then the 'b's of which there are 2, so 2 characters again, and finally the 'c's of which there are 10, so 3 characters - total 2+2+3=7 characters). If the original string has fewer bytes than the "compressed" string would require, return false and stop processing (for 'abbcccccccccc' we get 13 > 7, so we continue processing. For 'ababab' we get 6 < 12, so we abort).
Iterate over the string again (for a bit of improved performance this step can be merged with the previous iteration), and replace every sequence where a character repeats more than once with the compressed sequence. Copy the string forwards while doing so and compute the newly required length. For example, 'abbcccccccccc' becomes 'ab2c10XXXXXXX', where the 'X' chars represent free space (unless you explicitly overwrite this, it will remain filled with 'c's, but is not currently in use). Since every transition will require at most as much space as before, this round will never use space that you don't currently have.
If your length matches the length you have computed in step 1, you are done.
If your length is less than the length computed in step 1, there are some single character runs left in the string, which can expand. To fix this, run over the string in reverse, starting at the target length. Copy as long as you find a character that is not preceded by a digit. This character is a single-character-run, so add in the '1' before it. After this step, your string will be as long as computed in step 1 and you are done.
Upvotes: 0
Reputation: 73376
This is a very nasty issue: string::erase()
has several signatures. There's a form with iterator, where you provide the start and end iterator for the region to be erased. You think to use this one.
But in reality, you don't work with iterators but with positions. With positions, erase()
uses a start position and a length. So instead of deleting just the 4 d
between position 5 and 9 as you would expect, it will erase 9 characters starting at position 5. So basically, with your example, it erases the rest of the string.
Correct it simply with:
A.erase((initial_index+1), count-1);
So instead of a do...while(...);
loop which assumes there's always a first char, use a normal while(...)...
loop.
You ignored this requirement:
if the "compressed" string would not become smaller than the original string, your method should return the original string.
If you try your code with an uncompressible string, you get longer strings:
stringCompression("abcde") ---> "a1b1c1d1e1"
If you want to comply, you should not add count
for any non-repeating char.
Your if...else...
would be simplified into:
if (count > 1)
{
convertInt = to_string(count);
A.erase((initial_index+1), count-1);
A.insert(initial_index+1, convertInt);
index = initial_index + convertInt.length();
count = 1;
}
C++ strings are not C strings: you are not allowed to access one char after its end to check if it's null! By the way, it's not guaranteed that a c++ string ends with a null char. Moreover, a string could perfectly contain a null char that is not terminal.
So for the while loops, you always need to check that inde stays within bounds. In the end, you get:
string stringCompression(string A) {
int index = 0; char character = ' '; int count = 1; string convertInt = ""; int initial_index = 0, length = A.length();
while (index <A.length()) {
character = A[index];
initial_index = index;
index++;
while (index<=A.length() && A[index] == character)
{
count++;
index++;
}
if (count > 1)
{
convertInt = to_string(count);
A.erase((initial_index+1), count-1);
A.insert(initial_index+1, convertInt);
index = initial_index+convertInt.length()+1;
count = 1;
}
}
return A;
}
If you transform the string, always get the length dynamically, because the length you get at the beginning of your function is very quickly obsolete ;-)
Deleting and inserting will copy each of the trailing chars twice. Considering that you're no longer inflating strings, a better way would be to only delete what has to be deleted, and replace the remaining characters with the count. You'll double the performance.*
Upvotes: 2
Reputation: 473437
Algorithms for compressing a byte stream in-place exist, but those algorithms are designed with in-place compression in mind. You cannot take just any old algorithm and make it work in-place without additional allocations.
This algorithm in particular is non-functional for in-place work. And this can be easily demonstrated. Simply take a string of "abababab".
You read the first character and store it. You read the second character, compare against the first, and you see that it's different. So you now need to write "a1", which would overwrite the first "b". That would overwrite important data; you need to keep around enough data to reconstruct the original string if the compressed one is too long.
Now maybe you could store the character you just read, the "b" that's about to be overwritten. You need to remember it anyway, so that's fine.
However, on the next character you need to write "b1". This would overwrite both the next "a" and the next "b". So you'd need to store two characters.
This process repeats, each time increasing the number of characters you need to store, until you reach the end and have stored a number of characters half the length of the original string.
If you have to allocate a string half as long as the original just to keep track of extra characters, I would not call that an "in-place" algorithm. You're still allocating memory, which is what an in-place algorithm is supposed to avoid.
Upvotes: 2