Ed Guiness
Ed Guiness

Reputation: 35267

Best way to reduce sequences in an array of strings

This question is about removing sequences from an array, not duplicates in the strict sense.

Consider this sequence of elements in an array;

[0] a
[1] a
[2] b
[3] c
[4] c
[5] a
[6] c
[7] d
[8] c
[9] d

In this example I want to obtain the following...

[0] a
[1] b
[2] c
[3] a
[4] c
[5] d

Notice that duplicate elements are retained but that sequences of the same element have been reduced to a single instance of that element.

Further, notice that when two lines repeat they should be reduced to one set (of two lines).

[0] c
[1] d
[2] c
[3] d

...reduces to...

[0] c
[1] d

I'm coding in C# but algorithms in any language appreciated.

Upvotes: 7

Views: 1079

Answers (4)

nlucaroni
nlucaroni

Reputation: 47934

EDIT: made some changes and new suggestions

What about a sliding window...

REMOVE LENGTH 2: (no other length has other matches)
//the lower case letters are the matches
ABCBAbabaBBCbcbcbVbvBCbcbcAB  
__ABCBABABABBCBCBCBVBVBCBCBCAB

REMOVE LENGTH 1 (duplicate characters):
//* denote that a string was removed to prevent continual contraction
//of the string, unless this is what you want.
ABCBA*BbC*V*BC*AB
_ABCBA*BBC*V*BC*AB

RESULT:
ABCBA*B*C*V*BC*AB == ABCBABCVBCAB

This is of course starting with length=2, increase it to L/2 and iterate down.

I'm also thinking of two other approaches:

  1. digraph - Set a stateful digraph with the data and iterate over it with the string, if a cycle is found you'll have a duplication. I'm not sure how easy it is check check for these cycles... possibly some dynamic programming, so it could be equivlent to method 2 below. I'm going to have to think about this one as well longer.
  2. distance matrix - using a levenstein distance matrix you might be able to detect duplication from diagonal movement (off the diagonal) with cost 0. This could indicate duplication of data. I will have to think about this more.

Upvotes: 3

sieben
sieben

Reputation: 2201

Here's C# app i wrote that solves this problem.

takes
aabccacdcd

outputs
abcacd

Probably looks pretty messy, took me a bit to get my head around the dynamic pattern length bit.

class Program
{
    private static List<string> values;
    private const int MAX_PATTERN_LENGTH = 4;

    static void Main(string[] args)
    {
        values = new List<string>();
        values.AddRange(new string[] { "a", "b", "c", "c", "a", "c", "d", "c", "d" });


        for (int i = MAX_PATTERN_LENGTH; i > 0; i--)
        {
            RemoveDuplicatesOfLength(i);
        }

        foreach (string s in values)
        {
            Console.WriteLine(s);
        }
    }

    private static void RemoveDuplicatesOfLength(int dupeLength)
    {
        for (int i = 0; i < values.Count; i++)
        {
            if (i + dupeLength > values.Count)
                break;

            if (i + dupeLength + dupeLength > values.Count)
                break;

            var patternA = values.GetRange(i, dupeLength);
            var patternB = values.GetRange(i + dupeLength, dupeLength);

            bool isPattern = ComparePatterns(patternA, patternB);

            if (isPattern)
            {
                values.RemoveRange(i, dupeLength);
            }
        }
    }

    private static bool ComparePatterns(List<string> pattern, List<string> candidate)
    {
        for (int i = 0; i < pattern.Count; i++)
        {
            if (pattern[i] != candidate[i])
                return false;
        }

        return true;
    }
}

fixed the initial values to match the questions values

Upvotes: 2

Bill the Lizard
Bill the Lizard

Reputation: 405765

I would dump them all into your favorite Set implementation.

EDIT: Now that I understand the question, your original solution looks like the best way to do this. Just loop through the array once, keeping an array of flags to mark which elements to keep, plus a counter to keep track to the size of the new array. Then loop through again to copy all the keepers to a new array.

Upvotes: 1

Terry
Terry

Reputation: 621

I agree that if you can just dump the strings into a Set, then that might be the easiest solution.

If you don't have access to a Set implementation for some reason, I would just sort the strings alphabetically and then go through once and remove the duplicates. How to sort them and remove duplicates from the list will depend on what language and environment you are running your code.

EDIT: Oh, ick.... I see based on your clarification that you expect that patterns might occur even over separate lines. My approach won't solve your problem. Sorry. Here is a question for you. If I had the following file.

a

a

b

c

c

a

a

b

c

c

Would you expect it to simplify to

a

b

c

Upvotes: 0

Related Questions