Reputation: 35267
This question is about removing sequences from an array, not duplicates in the strict sense.
Consider this sequence of elements in an array;
[0] a
[1] a
[2] b
[3] c
[4] c
[5] a
[6] c
[7] d
[8] c
[9] d
In this example I want to obtain the following...
[0] a
[1] b
[2] c
[3] a
[4] c
[5] d
Notice that duplicate elements are retained but that sequences of the same element have been reduced to a single instance of that element.
Further, notice that when two lines repeat they should be reduced to one set (of two lines).
[0] c
[1] d
[2] c
[3] d
...reduces to...
[0] c
[1] d
I'm coding in C# but algorithms in any language appreciated.
Upvotes: 7
Views: 1079
Reputation: 47934
EDIT: made some changes and new suggestions
What about a sliding window...
REMOVE LENGTH 2: (no other length has other matches)
//the lower case letters are the matches
ABCBAbabaBBCbcbcbVbvBCbcbcAB
__ABCBABABABBCBCBCBVBVBCBCBCAB
REMOVE LENGTH 1 (duplicate characters):
//* denote that a string was removed to prevent continual contraction
//of the string, unless this is what you want.
ABCBA*BbC*V*BC*AB
_ABCBA*BBC*V*BC*AB
RESULT:
ABCBA*B*C*V*BC*AB == ABCBABCVBCAB
This is of course starting with length=2, increase it to L/2 and iterate down.
I'm also thinking of two other approaches:
Upvotes: 3
Reputation: 2201
Here's C# app i wrote that solves this problem.
takes
aabccacdcd
outputs
abcacd
Probably looks pretty messy, took me a bit to get my head around the dynamic pattern length bit.
class Program
{
private static List<string> values;
private const int MAX_PATTERN_LENGTH = 4;
static void Main(string[] args)
{
values = new List<string>();
values.AddRange(new string[] { "a", "b", "c", "c", "a", "c", "d", "c", "d" });
for (int i = MAX_PATTERN_LENGTH; i > 0; i--)
{
RemoveDuplicatesOfLength(i);
}
foreach (string s in values)
{
Console.WriteLine(s);
}
}
private static void RemoveDuplicatesOfLength(int dupeLength)
{
for (int i = 0; i < values.Count; i++)
{
if (i + dupeLength > values.Count)
break;
if (i + dupeLength + dupeLength > values.Count)
break;
var patternA = values.GetRange(i, dupeLength);
var patternB = values.GetRange(i + dupeLength, dupeLength);
bool isPattern = ComparePatterns(patternA, patternB);
if (isPattern)
{
values.RemoveRange(i, dupeLength);
}
}
}
private static bool ComparePatterns(List<string> pattern, List<string> candidate)
{
for (int i = 0; i < pattern.Count; i++)
{
if (pattern[i] != candidate[i])
return false;
}
return true;
}
}
fixed the initial values to match the questions values
Upvotes: 2
Reputation: 405765
I would dump them all into your favorite Set implementation.
EDIT: Now that I understand the question, your original solution looks like the best way to do this. Just loop through the array once, keeping an array of flags to mark which elements to keep, plus a counter to keep track to the size of the new array. Then loop through again to copy all the keepers to a new array.
Upvotes: 1
Reputation: 621
I agree that if you can just dump the strings into a Set, then that might be the easiest solution.
If you don't have access to a Set implementation for some reason, I would just sort the strings alphabetically and then go through once and remove the duplicates. How to sort them and remove duplicates from the list will depend on what language and environment you are running your code.
EDIT: Oh, ick.... I see based on your clarification that you expect that patterns might occur even over separate lines. My approach won't solve your problem. Sorry. Here is a question for you. If I had the following file.
a
a
b
c
c
a
a
b
c
c
Would you expect it to simplify to
a
b
c
Upvotes: 0