Reputation: 79
I am reading a file, the content is as below:
Aug2017:
--------------------------------------
Name Age Phone
--------------------------------------
Jack 25 128736372
Peter 26 987840392
--------------------------------------
Sep2017:
--------------------------------------
Name Age Phone
--------------------------------------
Jared 21 874892032
Eric 24 847938427
--------------------------------------
So I wanted to extract the information between every dashed line and put them into a list. Assuming $data
is containing the file content, I am using the tcl regexp below to extract the info:
regexp -all -inline -- {\s+\-{2,}\s+(.*?)\s+\-{2,}\s+} $data
As I know, the returned matched result will be stored as a list that containing fullMatch
and subMatch
.
I double checked with llength
command, there is only one fullMatch
and subMatch
.
llength $data
2
Why is there only 1 subMatch
? There supposed to be 5 matches like below:
Aug2017:
--------------------------------------
Name Age Phone --> 1st Match
--------------------------------------
Jack 25 128736372
Peter 26 987840392 --> 2nd Match
--------------------------------------
Sep2017: --> 3rd Match
--------------------------------------
Name Age Phone --> 4th Match
--------------------------------------
Jared 21 874892032
Eric 24 847938427 --> 5th Match
--------------------------------------
So in this case, I am choosing the second list element (subMatch
) with lindex
.
lindex [regexp -all -inline -- {\s+\-{2,}\s+(.*?)\s+\-{2,}\s+} $data] 1
However the result I got is like this, seems like it is matching from the beginning and end of the content:
Name Age Phone
--------------------------------------
Jack 25 128736372
Peter 26 987840392
--------------------------------------
Sep2017:
--------------------------------------
Name Age Phone
--------------------------------------
Jared 21 874892032
Eric 24 847938427
My impression was regexp should match from the beginning and match sequentially to the end of the string, not sure why tcl regex is behaving like this? Am I missing something?
** The main thing I want to achieve here is to extract data between the dashed separator, the above data is just an example.
Expected result: a list that containing all matches
{ {Name Age Phone} -->1st match
{Jack 25 128736372
Peter 26 987840392} -->2nd match
{Sep2017:} -->3rd match
{Name Age Phone} -->4th match
{Jared 21 874892032
Eric 24 847938427} -->5th match
}
UPDATE: I have slightly changed my tcl regex as below, to include the lookahead and the suggestion by @glenn:
regexp -all -inline -expanded -- {\s+?-{2,}\s+?(.*?)(?=\s+?-{2,}\s+?)} $data
The result I got (10 submatches):
{ {----------------------
Name Age Phone} -->1st match
{Name Age Phone} -->2nd match
{----------------------
Jack 25 128736372
Peter 26 987840392} -->3rd match
{Jack 25 128736372
Peter 26 987840392} -->4th match
{----------------------
Sep2017:} -->5th match
{Sep2017:} -->6th match
...
...
}
It is pretty close to the expected result, but I still want to figure out how to use regex to perfectly match the expected 5 submatches.
Upvotes: 2
Views: 1151
Reputation: 13252
Regular expression matching is not a good tool for this kind of problem. You're much better off with some kind of line filter.
A regular expression-based filter, closely matched to your example lines:
set f [open data.txt]
while {[gets $f line] >= 0} {
if {[regexp {:} $line]} continue
if {![regexp {\d} $line]} continue
puts $line
}
close $f
Rationale: only month name lines have colons, header lines and separators have no digits in them.
A filter that doesn't rely as much on regular expressions:
set f [open data.txt]
set skip 4
while {[gets $f line] >= 0} {
if {$skip < 1} {
if {[regexp {\-{2,}} $line]} {
set skip 4
} else {
puts $line
}
} else {
incr skip -1
}
}
close $f
This code reads every line, skips four lines at the beginning of each month, and resets the skip to 4 when a line of dashes interrupts the data.
(Note: the expression \-{2,}
makes it look like the dash is special in a regular expression and needs to be escaped for that reason. Actually, it's because if the dash is the first character in the expression, the regexp
command tries to interpret it as a switch. regexp -- {-{2,}} ...
would work too but looks even stranger, I think.)
ETA (see comment): to get data between separators (i.e. just filter out the separators), try this:
set f [open data.txt]
while {[gets $f line] >= 0} {
if {![regexp {\-{2,}} $line]} {
puts $line
}
}
close $f
Or:
package require fileutil
::fileutil::foreachLine line data.txt {
if {![regexp {\-{2,}} $line]} {
puts $line
}
}
This should also work:
regsub -all -line {^\s+-{2,}.*(\n|\Z)} $data {}
Enabling newline-sensitive matching, this matches and removes all lines consisting only of whitespace, dashes, optional non-newlines and either a newline character or the end-of-outer-string.
To collect a list of matches rather than just printing filtered lines:
set matches {}
set matchtext {}
::fileutil::foreachLine line data.txt {
if {![regexp {\-{2,}} $line]} {
append matchtext $line\n
} else {
lappend matches $matchtext
set matchtext {}
}
}
After running this, the variable matches
contains a list whose items are contiguous lines between separators.
Another way to to the same thing:
::textutil::splitx $data {(?n)^\s+-{2,}.*(?:\n|\Z)}
(It also adds an empty element at the end of the list, which is easy enough to remove if it is a problem.)
Documentation: < (operator), >= (operator), append, close, continue, fileutil (package), gets, if, incr, lappend, open, package, puts, regexp, set, textutil (package), while, Syntax of Tcl regular expressions
Upvotes: 2