prologpattern-matchingsliding-windowdcggomoku

Reputation: 55

Sliding window method for pattern recognition in SWI-Prolog

I would like to find the most eloquent and efficient method, algorithmically speaking, to count occurrences of some patterns in SWI-Prolog.

For now, my solution uses DCG and looks like this:

count_occurrences(Pattern, List, Count) :-
    phrase(Pattern, PatternList),
    length(PatternList, PatternLength),
    length(List, ListLength),
    MaxIndex is ListLength - PatternLength,
    findall(1, (
        between(0, MaxIndex, Index),
        sublist(List, Index, PatternLength, PatternList)
    ), Counts),
    sum_list(Counts, Count).

sublist(List, Index, Length, Sublist) :-
    length(Sublist, Length),
    append(Prefix, Rest, List),
    length(Prefix, Index),
    append(Sublist, _, Rest).

rep(S, L) --> [], {L is 0} | [S], {L is 1} | [S], { L_1 is L - 1 }, rep(S, L_1), !.
open_rep(S, L) --> [v], rep(S, L), [v], !.

The result is this:

1 ?- count_occurrences(open_rep(n, 1), [v,n,v,n,v,v,v,v,v,v,n,v,v], Count).
Count = 3.

2 ?- count_occurrences(open_rep(n, 3), [v,n,v,n,v,v,v,v,v,v,n,n,n,v,v], Count).
Count = 1.

3 ?- count_occurrences(open_rep(n, 1), [v,n,v,n,v,v,v,v,v,v,n,n,n,v,v], Count).
Count = 2.

I am satisfied with this result, but I don't think this is the most efficient method to achieve it and would like some help making it faster to compute. I think that using append and findall is computationally expensive, but can't figure how to do the same without them...

The DCG shown here is just an example, but I have to count the occurrences of patterns in a list and give them a score. This is in the context of implementing an AI for Gomoku using Alpha-Beta Pruning. Since the board is evaluated frequently, algorithmic complexity matters in order to reduce the time it takes for the AI to take an action.

I have tried multiple variations of the code shown above, but they all use the findall predicate and the best solution I have found to reduce the compute time is to implement early fails.

Upvotes: 3

Answers (5)

Will Ness

Reputation: 71119

Applying the empirical orders of growth analysis, you can see that your code exhibits quadratic behavior while the various answers' code variants are linear.

Why is that?

It is because you insist on forcing the infix to start at a known index, increasing the index one by one. Then the infix string is found by the pair of appends, each time starting afresh from the input list's very beginning.

Thus the work is done in the classic triangular fashion (as is also illustrated in the linked entry), causing the problematic quadratic behavior.

But the two appends will do this on their own, moreover restarting at the last processed index and not from the index 0. Instead of starting from 0 and going along the input list to the given index i they will just continue from i directly; thus having the linear execution time.

To achieve this with your code, you just need to delete from it every bit dealing with the index manipulations, and it will magically turn into the efficient variant of what you see in the answers.

This gets you

count_occurrences(Pattern, List, Count) :-
    phrase(Pattern, PatternList),
    /*length(PatternList, PatternLength),
    length(List, ListLength),
    MaxIndex is ListLength - PatternLength,*/
    findall(1, (
        /*between(0, MaxIndex, Index),*/
        sublist(List, /*Index, PatternLength,*/ PatternList)
    ), Counts),
    sum_list(Counts, Count).

sublist(List, /*Index, Length,*/ Sublist) :-
    /*length(Sublist, Length),*/
    append(_Prefix, Rest, List),
    /*length(Prefix, Index),*/
    append(Sublist, _, Rest).

which checks out linear, as expected:

167 ?- random_list(1000,L),time(count_occurrences(open_rep(n,3),L,C)).
% 3,094 inferences, 0.000 CPU in 0.000 seconds (?% CPU, Infinite Lips)
L = [v, n, n, v, v, v, n, n, n|...],
C = 24.

168 ?- random_list(10000,L),time(count_occurrences(open_rep(n,3),L,C)).
% 30,514 inferences, 0.000 CPU in 0.003 seconds (0% CPU, Infinite Lips)
L = [n, n, v, v, v, v, v, n, v|...],
C = 280.

Tested using random_list/2 from another answer here.

Now you can fuse your calls to findall and sum_list into one predicate count as seen in that answer. Be sure to test both approaches though, with your actual data. Using library functions might still be faster. So, do also try the aggregate_all approach from TA_intern's answer.

Demand and control less, achieve more! Or,

Premature ~~optimization~~ implementation is the root of all evil.

Upvotes: 3

CapelliC

Reputation: 60034

IMO your approach is too much specific, and it will be sub optimal from the (re)usability viewpoint. SWI-Prolog offers a library predicate that performs RLE (run length encoding), as I discovered from this interesting topic, and is worth to try: here I'll post a module, where I copied your code, and an alternative that uses clumped/2:

/*  File:    x_so_sliding_window.pl
    Author:  Carlo
    Created: Mar  4 2023
    Purpose: https://stackoverflow.com/q/75630809/874024
*/

:- module(x_so_sliding_window,
          [count_occurrences/3
          ,open_rep//2
          ,count_by_clumped/3
          ]).


count_occurrences(Pattern, List, Count) :-
    phrase(Pattern, PatternList),
    length(PatternList, PatternLength),
    length(List, ListLength),
    MaxIndex is ListLength - PatternLength,
    findall(1, (
        between(0, MaxIndex, Index),
        sublist(List, Index, PatternLength, PatternList)
    ), Counts),
    sum_list(Counts, Count).

sublist(List, Index, Length, Sublist) :-
    length(Sublist, Length),
    append(Prefix, Rest, List),
    length(Prefix, Index),
    append(Sublist, _, Rest).

rep(S, L) --> [], {L is 0} | [S], {L is 1} | [S], { L_1 is L - 1 }, rep(S, L_1), !.
open_rep(S, L) --> [v], rep(S, L), [v], !.


count_by_clumped(Pattern,List,Count) :-
    clumped(List, R),
    aggregate_all(count, member(Pattern,R), Count).

then I had this code in x_so_sliding_window.plt

:- [x_so_sliding_window].

t(Count) :- count_occurrences(open_rep(n, 1), [v,n,v,n,v,v,v,v,v,v,n,v,v], Count).
t(Count) :- count_occurrences(open_rep(n, 3), [v,n,v,n,v,v,v,v,v,v,n,n,n,v,v], Count).
t(Count) :- count_occurrences(open_rep(n, 1), [v,n,v,n,v,v,v,v,v,v,n,n,n,v,v], Count).

c(Count) :- count_by_clumped(n-1, [v,n,v,n,v,v,v,v,v,v,n,v,v], Count).
c(Count) :- count_by_clumped(n-3, [v,n,v,n,v,v,v,v,v,v,n,n,n,v,v], Count).
c(Count) :- count_by_clumped(n-1, [v,n,v,n,v,v,v,v,v,v,n,n,n,v,v], Count).

that shows at first glance the equivalence between t/1 and c/1 calls:

?- t(Count).
Count = 3 ;
Count = 1 ;
Count = 2.

?- c(Count).
Count = 3 ;
Count = 1 ;
Count = 2.

and, at last, the 'benchmark' directly coded in the REPL:

?- time((between(1,1000,_),(t(_),fail))).
% 1,953,001 inferences, 0.234 CPU in 0.234 seconds (100% CPU, 8332804 Lips)
false.
?- time((between(1,1000,_),(c(_),fail))).
% 123,001 inferences, 0.000 CPU in 0.014 seconds (0% CPU, Infinite Lips)
false.

Upvotes: 1

TA_intern

Reputation: 2436

Something in your question is missing, can't tell what. I am also surprised by the answers.

Not sure if you must use lists or if you can use atoms or strings. Atoms and strings have sub_atom/5 and sub_string/5, resp, which can be used for searching. Like this:

?- sub_atom(vnvnvvvvvvnvv, Before, Len, After, vnv).
Before = 0, Len = 3, After = 10 ;
Before = 2, Len = 3, After = 8 ;
Before = 9, Len = 3, After = 1 ;
false.

The same but with strings works with sub_string/5.

And if you are already using SWI-Prolog, you can use library(aggregate) to count solutions:

?- aggregate_all(count, sub_atom(vnvnvvvvvvnvv, _,_,_, vnv), N).
N = 3.

?- aggregate_all(count, sub_atom(vnvnvvvvvvnnnvv, _,_,_, vnv), N).
N = 2.

So far you don't need to program anything. If you must use lists, the two appends are enough:

?- aggregate_all(
      count,
      ( append(_, Back, [v,n,v,n,v,v,v,v,v,v,n,n,n,v,v]),
        append([v,n,n,n,v], _, Back)
      ),
      N).
N = 1.

And remember that you also have the regular expression library library(pcre). If you use the example code from the docs:

:- use_module(library(pcre)).

re_match_count(Regex, String, N) :-
    re_foldl(inc, Regex, String, 0, N, []).

inc(_, V0, V) :- V is V0 + 1.

you can also count, using either atoms or strings:

?- re_match_count('vn{1}(?=v)', vnvnvvvvvvnvv, N).
N = 3.

?- re_match_count("vn{3}(?=v)", "vnvnvvvvvvnnnvv", N).
N = 1.

(Note: these regexes use "zero-width positive lookahead". They were initially borked by me, and fixed after Will Ness noticed the mistake).

In this example all information about the match itself is discarded and only the number of matches is counted. You could do anything you feel using the first argument to re_foldl/6.

Note: you need to measure what works fastest for you. If there is a way to read and keep the strings as Prolog strings, and use library(pcre), that might be faster than other methods.

Finally, if you want to use a DCG for the pattern recognition, you can use the following code to "match anywhere":

match_dcg(G) --> string(_), call(G), remainder(_).

This now completely ignores what comes before and after, or the position of the match within the list. You can use it like this, with your own open_rep//2:

?- aggregate_all(
       count,
       phrase(match_dcg(open_rep(n, 1)), [v,n,v,n,v,v,v,v,v,v,n,v,v]),
       N).
N = 3.

The difference between match_dcg//1 as defined here and the match//1 in the example in the docs to phrase_from_file/2 is that match_dcg//1 takes another DCG as an argument, not a pattern (substring).

Upvotes: 3

brebs

Reputation: 4456

Here's a better method for a sublist:

sublist1([H|T], Index, Length, Sublist) :-
    sublist1_start_(T, H, 1, Index, Length, Sublist).

% P = previous
sublist1_start_(L, P, Ind, Index, Length, Sublist) :-
    sublist1_loop_(L, P, Ind, Index, 1, Length, Sublist).
sublist1_start_([H|T], _, Ind, Index, Length, Sublist) :-
    Ind1 is Ind + 1,
    % Go to next element
    sublist1_start_(T, H, Ind1, Index, Length, Sublist).

sublist1_loop_(_, H, Ind, Ind, Len, Len, [H]).
sublist1_loop_([H|T], P, Ind, Index, Len, Length, [P|Sublist]) :-
    Len1 is Len + 1,
    sublist1_loop_(T, H, Ind, Index, Len1, Length, Sublist).

Results in swi-prolog:

?- sublist1([a,b,c], Ind, Len, Sub).
Ind = Len, Len = 1,
Sub = [a] ;
Ind = 1,
Len = 2,
Sub = [a, b] ;
Ind = 1,
Len = 3,
Sub = [a, b, c] ;
Ind = 2,
Len = 1,
Sub = [b] ;
Ind = Len, Len = 2,
Sub = [b, c] ;
Ind = 3,
Len = 1,
Sub = [c].

If you want the indexing to start from 0 rather than 1, then change the 1 to 0 in the 2nd line.

I've excluded empty lists.

Upvotes: 1

slago

Reputation: 5519

I think that, even using the predicate append/3, you can still have an efficient solution (at least that's what can be observed with swi-prolog).

count(Pattern, List, Count) :-
    phrase(Pattern, Sublist),
    count(List, Sublist, 0, Count).

count([], _, Count, Count).
count([X|Xs], Sublist, Count0, Count) :-
    (   append(Sublist, _, [X|Xs])
    ->   Count1 is Count0 + 1
    ;    Count1 is Count0 ),
    count(Xs, Sublist, Count1, Count).

% To compare the two versions with large lists

random_list(0, []) :- !.
random_list(N, [X|Xs]) :- 
    random_member(X, [n,v]), 
    M is N-1, 
    random_list(M, Xs).

Comparison of count/3 and count_occurrences/3, using swi-prolog:

?- random_list(1000,L), time(count(open_rep(n,1),L,C1)), time(count_occurrences(open_rep(n,1),L,C2)).
% 4,648 inferences, 0.000 CPU in 0.000 seconds (0% CPU, Infinite Lips)
% 3,003,226 inferences, 1.578 CPU in 1.683 seconds (94% CPU, 1903034 Lips)
L = [v, n, v, v, n, n, n, n, v|...],
C1 = C2, C2 = 116.

?- random_list(2000,L), time(count(open_rep(n,1),L,C1)), time(count_occurrences(open_rep(n,1),L,C2)).
% 9,288 inferences, 0.000 CPU in 0.000 seconds (0% CPU, Infinite Lips)
% 12,006,431 inferences, 11.812 CPU in 12.421 seconds (95% CPU, 1016417 Lips)
L = [n, n, v, v, n, n, n, v, n|...],
C1 = C2, C2 = 229.

?- random_list(1000000,L), time(count(open_rep(n,3),L,C1)).
% 4,905,979 inferences, 0.109 CPU in 0.146 seconds (75% CPU, 44854665 Lips)
L = [n, v, v, v, n, v, n, n, n|...],
C1 = 31246.

?- L = [v,n,v,n,v,v,v,v,v,v,n,v,v], time(count(open_rep(n,1),L,C1)), time(count_occurrences(open_rep(n,1),L,C2)).
% 77 inferences, 0.000 CPU in 0.000 seconds (0% CPU, Infinite Lips)
% 571 inferences, 0.000 CPU in 0.000 seconds (0% CPU, Infinite Lips)
L = [v, n, v, n, v, v, v, v, v|...],
C1 = C2, C2 = 3.

?- L = [v,n,v,n,v,v,v,v,v,v,n,n,n,v,v], time(count(open_rep(n,3),L,C1)), time(count_occurrences(open_rep(n,3),L,C2)).
% 99 inferences, 0.000 CPU in 0.000 seconds (0% CPU, Infinite Lips)
% 641 inferences, 0.000 CPU in 0.000 seconds (0% CPU, Infinite Lips)
L = [v, n, v, n, v, v, v, v, v|...],
C1 = C2, C2 = 1.

?- L = [v,n,v,n,v,v,v,v,v,v,n,n,n,v,v], time(count(open_rep(n,1),L,C1)), time(count_occurrences(open_rep(n,1),L,C2)).
% 86 inferences, 0.000 CPU in 0.000 seconds (0% CPU, Infinite Lips)
% 739 inferences, 0.000 CPU in 0.000 seconds (0% CPU, Infinite Lips)
L = [v, n, v, n, v, v, v, v, v|...],
C1 = C2, C2 = 2.

Upvotes: 2

Sliding window method for pattern recognition in SWI-Prolog

Answers (5)

Related Questions