Jonjilla
Jonjilla

Reputation: 463

Tcl partial string match of one list against another

I'm trying to find the items in list1 that are partial string matches against items from list2 using Tcl.

I'm using this, but it's very slow. Is there any more efficient way to do this?

set list1 [list abc bcd cde]
set list2 [list ab cd]

set l_matchlist [list]
foreach item1 $list1 {
     foreach item2 $list2 {
          if {[string match -nocase "*${item2}*" $item1]} {
               lappend l_matchlist $item1
               break
          }
     }
}

my actual lists are very long and this takes a long time. Is this the best way to do this?

Upvotes: 0

Views: 778

Answers (2)

steveh1491
steveh1491

Reputation: 21

You could cheat a bit and turn it from a list-processing task to a string processing task. The latter are usually quite a bit faster in Tcl.

Below I first turn list1 into a string with the original list elements separated by the ASCII field separator character "\x1F". Then the result can be gotten in a single loop via a regular expression search. The regular expression finds the first substring bounded by the field separator chars that contains item2:

# convert list to string:
set string1 \x1F[join $list1 \x1F]\x1F

set l_matchlist [list]
foreach item2 $list2 {
    # escape out regexp special chars:
    set item2 [regsub -all {\W} $item2 {\\&}]
    # use append to assemble regexp pattern
    set item2 [append x {[^\x1F]*} $item2 {[^\x1F]*}][unset x]
    if {[regexp -nocase $item2 $string1 match]} {
        lappend l_matchlist $match
    }
}

Upvotes: 2

Schelte Bron
Schelte Bron

Reputation: 4813

In addition to being slow, there is also a problem if list2 contains elements that have glob wildcard characters, such as '?' and '*'.

I expect the following method will work faster. At least it fixes the issue mentioned above:

set list1 [list abc BCD ace cde]
set list2 [list cd ab de]

set l_matchlist [list]
foreach item2 $list2 {
    lappend l_matchlist \
      {*}[lsearch -all -inline -nocase -regexp $list1 (?q)$item2]
}

The -regexp option in combination with (?q) may seem strange at first. It uses regexp matching and then tells regexp to treat the pattern as a literal string. But this has the effect of performing the partial match that you're after.

This differs with your version in that it may produce the results in a different order and the same item from list1 may be reported multiple times if it matches more than one item in list2.

If that is undesired, you can follow up with:

set l_matchlist [lmap item1 $list1 {
    if {$item1 ni $l_matchlist} continue
    set item1
}]

Of course that will reduce some of the speed gains achieved earlier.

Upvotes: 3

Related Questions