Bouke
Bouke

Reputation: 104

Why does this Rascal pattern matching code use so much memory and time?

I'm trying to write what I would think of as an extremely simple piece of code in Rascal: Testing if list A contains list B.

Starting out with some very basic code to create a list of strings

public list[str] makeStringList(int Start, int End)
{
    return [ "some string with number <i>" | i <- [Start..End]];
}

public list[str] toTest = makeStringList(0, 200000); 

My first try was 'inspired' by the sorting example in the tutor:

public void findClone(list[str] In,  str S1, str S2, str S3, str S4, str S5, str S6)
{
    switch(In)
    {
        case [*str head, str i1, str i2, str i3, str i4, str i5, str i6, *str tail]:   
        {
            if(S1 == i1 && S2 == i2 && S3 == i3 && S4 == i4 && S5 == i5 && S6 == i6)
            {
                println("found duplicate\n\t<i1>\n\t<i2>\n\t<i3>\n\t<i4>\n\t<i5>\n\t<i6>");
            }
            fail;
         }   
         default:
            return;
    }
}

Not very pretty, but I expected it to work. Unfortunately, the code runs for about 30 seconds before crashing with an "out of memory" error.

I then tried a better looking alternative:

public void findClone2(list[str] In, list[str] whatWeSearchFor)
{
    for ([*str head, *str mid, *str end] := In)
    if (mid == whatWeSearchFor)
        println("gotcha");
} 

with approximately the same result (seems to run a little longer before running out of memory)

Finally, I tried a 'good old' C-style approach with a for-loop

public void findClone3(list[str] In, list[str] whatWeSearchFor)
{
    cloneLength = size(whatWeSearchFor);
    inputLength = size(In);

    if(inputLength < cloneLength) return [];

    loopLength = inputLength - cloneLength + 1;

    for(int i <- [0..loopLength])
    {
        isAClone = true;
        for(int j <- [0..cloneLength])
        {
            if(In[i+j] != whatWeSearchFor[j])
                isAClone = false;
        }

        if(isAClone) println("Found clone <whatWeSearchFor> on lines <i> through <i+cloneLength-1>");   
    }
}

To my surprise, this one works like a charm. No out of memory, and results in seconds.

I get that my first two attempts probably create a lot of temporary string objects that all have to be garbage collected, but I can't believe that the only solution that worked really is the best solution.

Any pointers would be greatly appreciated.

My relevant eclipse.ini settings are

-XX:MaxPermSize=512m
-Xms512m
-Xss64m
-Xmx1G

Upvotes: 3

Views: 402

Answers (2)

Jurgen Vinju
Jurgen Vinju

Reputation: 6696

It's an algorithmic issue like Mark Hills said. In Rascal some short code can still entail a lot of nested loops, almost implicitly. Basically every * splice operator on a fresh variable that you use on the pattern side in a list generates one level of loop nesting, except for the last one which is just the rest of the list.

In your code of findClone2 you are first generating all combinations of sublists and then filtering them using the if construct. So that's a correct algorithm, but probably slow. This is your code:

void findClone2(list[str] In, list[str] whatWeSearchFor)
{
    for ([*str head, *str mid, *str end] := In)
    if (mid == whatWeSearchFor)
        println("gotcha");
}

You see how it has a nested loop over In, because it has two effective * operators in the pattern. The code runs therefore in O(n^2), where n is the length of In. I.e. it has quadratic runtime behaviour for the size of the In list. In is a big list so this matters.

In the following new code, we filter first while generating answers, using fewer lines of code:

public void findCloneLinear(list[str] In, list[str] whatWeSearchFor)
{
    for ([*str head, *whatWeSearchFor, *str end] := In)
        println("gotcha");
} 

The second * operator does not generate a new loop because it is not fresh. It just "pastes" the given list values into the pattern. So now there is actually only one effective * which generates a loop which is the first on head. This one makes the algorithm loop over the list. The second * tests if the elements of whatWeSearchFor are all right there in the list after head (this is linear in the size of whatWeSearchFor and then the last *_ just completes the list allowing for more stuff to follow.

It's also nice to know where the clone is sometimes:

public void findCloneLinear(list[str] In, list[str] whatWeSearchFor)
{
    for ([*head, *whatWeSearchFor, *_] := In)
        println("gotcha at <size(head)>");
} 

Rascal does not have an optimising compiler (yet) which might possibly internally transform your algorithms to equivalent optimised ones. So as a Rascal programmer you are still asked to know the effect of loops on your algorithms complexity and know that * is a very short notation for a loop.

Upvotes: 0

Mark Hills
Mark Hills

Reputation: 1038

We'll need to look to see why this is happening. Note that, if you want to use pattern matching, this is maybe a better way to write it:

public void findClone(list[str] In,  str S1, str S2, str S3, str S4, str S5, str S6) {
    switch(In) {
        case [*str head, S1, S2, S3, S4, S5, S6, *str tail]: {
            println("found duplicate\n\t<S1>\n\t<S2>\n\t<S3>\n\t<S4>\n\t<S5>\n\t<S6>"); 
        } 
        default: 
            return; 
    } 
}

If you do this, you are taking advantage of Rascal's matcher to actually find the matching strings directly, versus your first example in which any string would match but then you needed to use a number of separate comparisons to see if the match represented the combination you were looking for. If I run this on 110145 through 110150 it takes a while but works and it doesn't seem to grow beyond the heap space you allocated to it.

Also, is there a reason you are using fail? Is this to continue searching?

Upvotes: 1

Related Questions