Naidu
Naidu

Reputation: 149

Extract data between square brackets "[]" using Perl

I was using a regex for extracting data from curved brackets (or "parentheses") like extracting a,b from (a,b) as shown below. I have a file in which every line will be like

this is the range of values (a1,b1) and [b1|a1]
this is the range of values (a2,b2) and [b2|a2]
this is the range of values (a3,b3) and [b3|a3]

I'm using the following string to extract a1,b1, a2,b2, etc...

@numbers = $_ =~ /\((.*),(.*)\)/

However, if I want to extract the data from square brackets [], how can I do it? For example

this is the range of values (a1,b1) and [b1|a1]
this is the range of values (a1,b1) and [b2|a2]

I need to extract/match only the data in square brackets and not the curved brackets.

Upvotes: 4

Views: 9433

Answers (5)

mihirjoshi
mihirjoshi

Reputation: 12201

I know I am a little late here but none of the answers correctly answered OP's question and the one that does actually matches the entire thing along with the square brackets []. Clearly the OP wants to match what is inside the brackets.

  • To match everything inside square brackets along with the brackets. Example

    \[[^\[\]]*]

  • To match everything inside square brackets excluding the brackets themselves use a positive look-head and look-behind. Example

    (?<=\[)[^\[\]]*(?=\])

Upvotes: 0

shreyaskar
shreyaskar

Reputation: 425

Use the below code

$_ =~ /\[(.*?)\|(.*?)\]/g;

Now if the pattern is successfully matched, the extracted values would be stored in $1 and $2 .

Upvotes: 0

Marius Schulz
Marius Schulz

Reputation: 16440

[Update] In the meantime, I've written a blog post about the specific issue with .* I describe below: Why Using .* in Regular Expressions Is Almost Never What You Actually Want


If your identifiers a1, b1 etc. never contain commas or square brackets themselves, you should use a pattern along the lines of the following to avoid backtracking hell:

/\[([^,\]]+),([^,\]]+)\]/

Here's a working example on Regex101.

The issue with greedy quantifiers like .* is that you'll very likely consume too much in the beginning so that the regex engine has to do extensive backtracking. Even if you use non-greedy quantifiers, the engine will do more attempts to match than necessary because it'll only consume one character at a time and then try to advance the position in the pattern.

(You could even use atomic groups to make the matching even more performant.)

Upvotes: 28

mpapec
mpapec

Reputation: 50637

You can match it using non-greedy quantifier *?

my @numbers = $_ =~ /\[(.*?),(.*?)\]/g;

or

my @numbers = /\[(.*?),(.*?)\]/g;

for short.

UPDATE

my @numbers = /\[(.*?)\|(.*?)\]/g;

Upvotes: 1

Chankey Pathak
Chankey Pathak

Reputation: 21666

#!/usr/bin/perl
# your code goes here
my @numbers;
while(chomp(my $line=<DATA>)){
    if($line =~ m|\[(.*),(.*)\]|){
    push @numbers, ($1,$2);
    }
}
print @numbers; 
__DATA__
this is the range of values [a1,b1]
this is the range of values [a2,b2]
this is the range of values [a3,b3]

Demo

Upvotes: 2

Related Questions