Tatt Ehian
Tatt Ehian

Reputation: 79

Tcl regexp not returning all matches

I am reading a file, the content is as below:

 Aug2017:
 --------------------------------------
   Name   Age   Phone
 --------------------------------------
   Jack   25    128736372
   Peter  26    987840392
 --------------------------------------
 Sep2017:
 --------------------------------------
   Name   Age   Phone
 --------------------------------------
   Jared  21    874892032
   Eric   24    847938427
 --------------------------------------

So I wanted to extract the information between every dashed line and put them into a list. Assuming $data is containing the file content, I am using the tcl regexp below to extract the info:

regexp -all -inline -- {\s+\-{2,}\s+(.*?)\s+\-{2,}\s+} $data

As I know, the returned matched result will be stored as a list that containing fullMatch and subMatch.

I double checked with llength command, there is only one fullMatch and subMatch.

llength $data
2

Why is there only 1 subMatch? There supposed to be 5 matches like below:

 Aug2017:
 --------------------------------------
   Name   Age   Phone       --> 1st Match
 --------------------------------------
   Jack   25    128736372
   Peter  26    987840392   --> 2nd Match
 --------------------------------------
 Sep2017:                   --> 3rd Match
 --------------------------------------
   Name   Age   Phone       --> 4th Match
 --------------------------------------
   Jared  21    874892032    
   Eric   24    847938427   --> 5th Match
 --------------------------------------

So in this case, I am choosing the second list element (subMatch) with lindex.

lindex [regexp -all -inline -- {\s+\-{2,}\s+(.*?)\s+\-{2,}\s+} $data] 1

However the result I got is like this, seems like it is matching from the beginning and end of the content:

  Name   Age   Phone
 --------------------------------------
  Jack   25    128736372
  Peter  26    987840392
 --------------------------------------
 Sep2017:
 --------------------------------------
  Name   Age   Phone
 --------------------------------------
  Jared  21    874892032
  Eric   24    847938427

My impression was regexp should match from the beginning and match sequentially to the end of the string, not sure why tcl regex is behaving like this? Am I missing something?

** The main thing I want to achieve here is to extract data between the dashed separator, the above data is just an example.

Expected result: a list that containing all matches

{ {Name   Age   Phone}      -->1st match 
  {Jack   25    128736372
   Peter  26    987840392}  -->2nd match
  {Sep2017:}                -->3rd match
  {Name   Age   Phone}      -->4th match
  {Jared  21    874892032
   Eric   24    847938427}  -->5th match
}

UPDATE: I have slightly changed my tcl regex as below, to include the lookahead and the suggestion by @glenn:

regexp -all -inline -expanded -- {\s+?-{2,}\s+?(.*?)(?=\s+?-{2,}\s+?)} $data

The result I got (10 submatches):

{ {----------------------
   Name   Age   Phone}      -->1st match
  {Name   Age   Phone}      -->2nd match
  {----------------------
   Jack   25    128736372
   Peter  26    987840392}  -->3rd match
  {Jack   25    128736372
   Peter  26    987840392}  -->4th match
  {----------------------
   Sep2017:}                -->5th match
  {Sep2017:}                -->6th match
    ...
    ...
}

It is pretty close to the expected result, but I still want to figure out how to use regex to perfectly match the expected 5 submatches.

Upvotes: 2

Views: 1151

Answers (1)

Peter Lewerin
Peter Lewerin

Reputation: 13252

Regular expression matching is not a good tool for this kind of problem. You're much better off with some kind of line filter.

A regular expression-based filter, closely matched to your example lines:

set f [open data.txt]
while {[gets $f line] >= 0} {
    if {[regexp {:} $line]} continue
    if {![regexp {\d} $line]} continue
    puts $line
}
close $f

Rationale: only month name lines have colons, header lines and separators have no digits in them.

A filter that doesn't rely as much on regular expressions:

set f [open data.txt]
set skip 4
while {[gets $f line] >= 0} {
    if {$skip < 1} {
        if {[regexp {\-{2,}} $line]} {
            set skip 4
        } else {
            puts $line
        }
    } else {
        incr skip -1
    }
}
close $f

This code reads every line, skips four lines at the beginning of each month, and resets the skip to 4 when a line of dashes interrupts the data.

(Note: the expression \-{2,} makes it look like the dash is special in a regular expression and needs to be escaped for that reason. Actually, it's because if the dash is the first character in the expression, the regexp command tries to interpret it as a switch. regexp -- {-{2,}} ... would work too but looks even stranger, I think.)

ETA (see comment): to get data between separators (i.e. just filter out the separators), try this:

set f [open data.txt]
while {[gets $f line] >= 0} {
    if {![regexp {\-{2,}} $line]} {
        puts $line
    }
}
close $f

Or:

package require fileutil

::fileutil::foreachLine line data.txt {
    if {![regexp {\-{2,}} $line]} {
        puts $line
    }
}

This should also work:

regsub -all -line {^\s+-{2,}.*(\n|\Z)} $data {}

Enabling newline-sensitive matching, this matches and removes all lines consisting only of whitespace, dashes, optional non-newlines and either a newline character or the end-of-outer-string.

To collect a list of matches rather than just printing filtered lines:

set matches {}
set matchtext {}
::fileutil::foreachLine line data.txt {
    if {![regexp {\-{2,}} $line]} {
        append matchtext $line\n
    } else {
        lappend matches $matchtext
        set matchtext {}
    }
}

After running this, the variable matches contains a list whose items are contiguous lines between separators.

Another way to to the same thing:

::textutil::splitx $data {(?n)^\s+-{2,}.*(?:\n|\Z)}

(It also adds an empty element at the end of the list, which is easy enough to remove if it is a problem.)

Documentation: < (operator), >= (operator), append, close, continue, fileutil (package), gets, if, incr, lappend, open, package, puts, regexp, set, textutil (package), while, Syntax of Tcl regular expressions

Upvotes: 2

Related Questions