Duck_dragon
Duck_dragon

Reputation: 450

Using regex to find start and end of paragraphs in text file

I have this text file which looks like below:

Current File: week-28\gcweb.txt (=>) ########## Old File: week-27\gcweb.txt (<=)



2019-07-21 13:20:42 ip-172-17-3-71=>
2019-07-17 13:27:12 ip-172-17-3-71<=
--------------------------------------------------
--------------------------------------------------
Current File: week-28\gcckup.txt (=>) ########## Old File: week-27\gcckup.txt (<=)



2019-07-21 13:20:46 ip-172-17-2-101=>
2019-07-17 13:27:14 ip-172-17-2-101<=
--------------------------------------------------
--------------------------------------------------

The text from Current File to ------ indicates one para or one part. I need to get all these separately and then apply some other operations on it. I tried using regex to get the entire text starting from Current File.

The regex I used is:

\bCurrent File\b.+ 

My question is: how can I select the whole text of one para? Having little experience with regex I am hoping to get something like this:

Current File: week28\gcweb.txt       Old File: week-27\gcweb.txt
2019-07-21 13:20:42 ip-172-17-3-71   2019-07-17 13:27:12 ip-172-17-3-71

While (=>) and (<=) are simply indicators for current and old. So I tried using this to get the file path \bCurrent File\b.+(=>) but this gives (=>) as group.

I need help with extracting the strings so that I can apply the rest of the operations on them after this.

Upvotes: 1

Views: 336

Answers (2)

The fourth bird
The fourth bird

Reputation: 163457

Another option to get the filenames in a group followed by the match could be:

Current File: (\S+\.txt)[^O]*(?:O(?!ld File)|[^O])+ Old File: (\S+\.txt).*(?:\r?\n(?!--).*)*(?=\r?\n--)
  • Current File: (\S+\.txt) Match Current File: and capture the filename in group 1.
  • [^O]* Match 0+ times any char except O
  • (?: Non capturing group
    • O(?!ld File) Match O, assert what is directly on the right is not ld File
    • | Or
    • [^O] Match any char except O
  • )+ Close non capturing group and repeat 1+ times
  • Old File: (\S+.txt) Match space, Old File: and capture the filename in group 2
  • .* Match any char except newline 0+ times
  • (?: Non capturing group
    • \r?\n(?!--) Match a newline and assert what is on the right is not --
    • .* Match any char except a newline 0+ times
  • )* Close non capturing group and repeat 0+ times
  • (?=\r?\n--) Positive lookahead, assert what is on the right is a newline and --

Regex demo

const regex = /Current File:[ \t]*(\S+\.txt)[^O]*(?:O(?!ld File)|[^O])+ Old File:[ \t]*(\S+\.txt).*(?:\r?\n(?!--).*)*(?=\r?\n--)/gm;
    const str = `Current File: week-28\\gcweb.txt (=>) ########## Old File: week-27\\gcweb.txt (<=)



2019-07-21 13:20:42 ip-172-17-3-71=>
2019-07-17 13:27:12 ip-172-17-3-71<=
--------------------------------------------------
--------------------------------------------------
Current File: week-28\\gcckup.txt (=>) ########## Old File: week-27\\gcckup.txt (<=)



2019-07-21 13:20:46 ip-172-17-2-101=>
2019-07-17 13:27:14 ip-172-17-2-101<=
--------------------------------------------------
--------------------------------------------------`;
    let m;

    while ((m = regex.exec(str)) !== null) {
        if (m.index === regex.lastIndex) {
            regex.lastIndex++;
        }

        m.forEach((match, groupIndex) => {
            console.log(`Found match, group ${groupIndex}: ${match}`);
        });        
    }

Upvotes: 1

Emma
Emma

Reputation: 27733

I guess you can for instance design some expression that'd look like,

Current File:[\s\S]*?(?=--)

The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.


Edit:

For getting .txt path, we can likely use an expression similar to:

Current File:\s*(\S+\.txt).*Old File:\s*(\S+\.txt)[\s\S]*?(?=-{4,})

Demo 2

const regex = /Current File:\s*(\S+\.txt).*Old File:\s*(\S+\.txt)[\s\S]*?(?=-{4,})/gm;
const str = `Current File: week-28\\gcweb.txt (=>) ########## Old File: week-27\\gcweb.txt (<=)



2019-07-21 13:20:42 ip-172-17-3-71=>
2019-07-17 13:27:12 ip-172-17-3-71<=
--------------------------------------------------
--------------------------------------------------
Current File: week-28\\gcckup.txt (=>) ########## Old File: week-27\\gcckup.txt (<=)



2019-07-21 13:20:46 ip-172-17-2-101=>
2019-07-17 13:27:14 ip-172-17-2-101<=
--------------------------------------------------
--------------------------------------------------`;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    // The result can be accessed through the `m`-variable.
    m.forEach((match, groupIndex) => {
        console.log(`Found match, group ${groupIndex}: ${match}`);
    });
}

Upvotes: 1

Related Questions