kjo
kjo

Reputation: 35301

What's the name of this programming pattern?

Often one needs to process a sequence of "chunks", which are read from a stream of "atoms", where each chunk consisting of a variable number of atoms, and there is no way for the program to know that it has received a complete chunk until it reads the first atom of the next chunk (or the stream of atoms becomes exhausted).

A straightforward algorithm for doing this task would look like this:

LOOP FOREVER:
    SET x TO NEXT_ATOM
    IF DONE(x) OR START_OF_CHUNK(x):
        IF NOT EMPTY(accum):
            PROCESS(accum)
        END
        if DONE(x):
            BREAK
        END
        RESET(accum)
    END
    ADD x TO accum
END

So, my question is this:

Is there a name for this general class of problems and/or for the programming pattern shown above?

The remainder of this post are just a couple of (reasonably realistic) examples of what's described abstractly above. (The examples are in Python, although they could be translated easily to any imperative language.)

The first one is a function to produce a run-length encoding of an input string. In this case, the "atoms" are individual characters, and the "chunks" are maximal runs of the same character. Therefore, the program does not know that it has reached the end of a run until it reads the first character in the following run.

def rle(s):
    '''Compute the run-length encoding of s.'''

    n = len(s)
    ret = []
    accum = 0
    v = object()  # unique sentinel; ensures first test against x succeeds
    i = 0
    while True:
        x = s[i] if i < n else None
        i += 1
        if x is None or x != v:
            if accum > 0:
                ret.append((accum, v))
            if x is None:
                break
            accum = 0
        v = x
        accum += 1

    return ret

The second example is a function that takes as argument a read handle to a FASTA-formatted file, and parses its contents. In this case, the atoms are lines of text. Each chunk consists of a specially-marked first line, called the "defline" (and distinguished by a '>' as its first character), followed by a variable number of lines containing stretches of nucleotide or protein sequence. Again, the code can detect the end of a chunk unambiguously only after reading the first atom (i.e. the defline) of the next chunk.

def read_fasta(fh):
    '''Read the contents of a FASTA-formatted file.'''

    ret = []
    accum = []
    while True:
        x = fh.readline()
        if x == '' or x.startswith('>'):
            if accum:
                ret.append((accum[0], ''.join(accum[1:])))
            if x == '':
                break
            accum = []
        accum.append(x.strip())

    return ret

Upvotes: 3

Views: 179

Answers (3)

templatetypedef
templatetypedef

Reputation: 372784

I believe that what you're describing is something called a streaming algorithm, an algorithm where the input is specified one element at a time until some stop condition is triggered. Streaming algorithms can be used to model algorithms where data is received over a network or from some device that generates data. Often, streaming algorithms assume that there is some fixed bound on the amount of memory that can be stored at any point in time, meaning that the algorithm needs to take special care to preserve important information while discarding useless data.

Many interesting algorithms in computer science fail to work in the streaming case, and there are a large class of algorithms specially designed to work for streams. For example, there are good streaming algorithms for finding the top k elements of a stream (see this question, for example), for randomly choosing k elements out of a stream, for finding elements that appear with high frequency in a stream, etc. One of the other answers to this question (from @andrew cooke) mentions that this resembles LL parsing. Indeed, LL parsing (and many other parsing algorithms, such as LR parsing) are streaming algorithms for doing parsing, but they're special cases of the more general streaming algorithm framework.

Hope this helps!

Upvotes: 1

phs
phs

Reputation: 11051

I implement this pattern regularly (esp. in conjunction with sorting) when aggregating simple statistics in indexing. I've never heard of a formal name but at our company internally we simply refer to it as "batching" or "grouping", after the SQL GROUP BY clause.

In our system batches are usually delimited by an extracted attribute (rather than a bald edge-driven predicate) which we call the batch or group key. By contrast your examples seem to check for explicit delimiters.

Upvotes: 1

andrew cooke
andrew cooke

Reputation: 46882

the only thing i can think of is that it's a very simple LL(1) parser. you are parsing (in a very simple way) data from left to right and you need to lookahead one value to know what is happening. see http://en.wikipedia.org/wiki/LL_parser

Upvotes: 2

Related Questions