Reputation: 145

Efficiently determining the most common substring

Given a sequence of length N, the string inside the Seq is A, B, C, D

Input: Seq={ACCBADBAACCBACADAAADC...DBACDBACD}
Output: string appear most of the time

Is there any fastest algorithm to look for string appear most of the time? Which mean

example: let say AAA only appear one time in the Seq then let say BAA appear also one time and so on... after that found out ACCBA appear 2 times in the Seq which is the string appear most of the time, so the output is ACCBA.state the worst case complexity of the algorithm also...

This might have a lot of answers...such as brute force approch but that is quite slow...no need provide the exact code...the psedo code should be enough... Please help me to provide with some clues or reference info...i would like to learn...

Upvotes: 1

Answers (2)

paxdiablo

Reputation: 881153

I see two interpretations of your question. I'll cover what I think is the most likely first.

That is the case where you want to count the longest subsequences of a character to see which subsequence occurs the most. In other words, the string AAABBBBBAAA would give you {2,AAA} since there are two of those longest-subsequences and only one of BBBBB.

In order to do that, you could use:

dim seqcount['A'..'D',1..len(str)] = 0   # Array to hold counts.
lastch = str[0]                          # Last character processed.
count = 1                                # Count of last char processed.
maxseqcount = 0                          # Largest quantity to date.
maxseqchars = ""                         # Letters of that largest quantity.

# Process the end of a sequence.

def endseq (thisch,thiscount):
    # Increase quantity for letter/length combo.

    seqcount[thisch,thiscount] = seqcount[thisch,thiscount] + 1

    # Quantity same as current max, add letter to list (if not already there).

    if seqcount[thisch,thiscount] == maxseqcount:
        if not maxseqchars.contains (thisch):
            maxseqchars = maxseqchars + thisch

    # Quantity greater than current max, change max and use this letter.

    if seqcount[thisch,thiscount] > maxseqcount:
        maxseqcount = seqcount[thisch,thiscount]
        maxseqchars = thisch

def main:
    # Process every character (other than first) once.

    for pos = 1 to len(str) - 1:
        # Still in a sequence, add to length and restart loop.

        if str[pos] == lastch:
            count = count + 1
            continue

        # Letter change, process end of sequence.

        endseq (lastch, count)

        # Then store this new character and re-init count.

        lastch = str[pos]
        count = 1

    # Termination, we still have the last sequence to deal with.

    endseq (lastch, count)

This will give you O(n) storage and O(n) time since you have a constant possibility of different characters (A thru D) and you're only processing each character in the string once.

At the end of the processing, you can get what you want with {maxseqcount,maxseqchars} although maxseqchars is not necessarily a single letter, since your input string may be something like ABAB which would mean that both {2,A} and {2,B} are equally valid.

The second (although unlikely) possibility is that you don't have to use the longest subsequences of a single character in which case the most occurring sequence will be {1,whatever single character occurs the most in the string}.

It looks like, from your update that allows a sequence to not be all the same character, that this latter possibility may be the case (allowing arbitrary lengths of any characters, same or different).

If so then, you simply process in O(n) every character to see which single character occurs the most. Then it simply the length=1 sequence of that character.

For example, if your string was ACCBA, then {1,ACCBA} is not the solution. Rather, it's {2,A} ot {2,C}.

Upvotes: 2

Hot Licks

Reputation: 47699

If you mean you want to find the longest string of identical characters, I believe this can be done in linear time with a relatively simple algorithm. You'd basically keep track of the current longest sequence character, it's count, the last character viewed, and the number of times that character has been seen consecutively. Just scan and update the values.

Upvotes: 1

Efficiently determining the most common substring

Answers (2)

Related Questions