Reputation: 1285

Perl Regex - capture all characters until a pattern

I'm trying to extract 4 chunks of information from a string. The string is the name of a file with the extension included. The first group can contain any valid characters until the space before the second group is reached. The second group of data will be 4 numbers contained inside of a set of square brackets. That group is separated by the first group by a space. The third group could be either 3 or 4 numbers followed by the letter "p". This groups is also separated by a space from the previous group. The last group is simply the file extension.

Here's an example:

This, could be ['a'] s(@m)pl3 file name_with any characters [1923] (720p).avi

That would then need to be parsed to be:

$1 = This, could be ['a'] s(@m)pl3 file name_with any characters
$2 = 1923
$3 = 720p
$4 = avi

Upvotes: 0

Answers (5)

David W.

Reputation: 107040

Whether Perl or not, sometimes the problem with a regular expression is its greediness. Let's say I want to capture the first name of someone and the string looked like this:

Bob Baker

I could use this regular expression:

sed 's/^\(.*)\ .*$/\1/'

That would work with Bob Baker, but not with Bob Barry Baker. The problem is that my regular expression is greedy and will select all of the characters up to the last space, so I would end up not with Bob but with Bob Baker. A common way to solve this is to specify all the characters except for the one you don't want:

sed 's/^\([^ ]*)\ .*$/\1/'

In this case, I am specifying any set of characters not including a space. This will change both Bob Baker and Bob Rudolph Baker to just Bob.

Perl has another way of specifying an non-greedy regular expression. In Perl, you add a ? to your sub-expression you want to be not greedy. In the above example, both of these will change a string containing Bob Barry Baker to just Bob:

$string =~ s/^([^ ]+) .*$/$1/;
$string =~ s/^(.+?) .*$/$1/;

By the way, these are not equivalent!

With the everything but a space regex, I could do this:

 $string =~ /^([^ ]+)( )(\[\d{4}\])( )(\(\d+p\))(\.)([^.]+)/

With the non-greedy qualifier:

$string =~ /^(.+?)( )(\[\d{4}\])( )(\(\d+p\))(\.)(.*)/

And, using the x qualifier which allows you to put the same regular expression over multiple lines which is nice because you can add comments to help explain what you're doing:

$string =~ /
     ^(.+?)                   #Any set of characters (non-greedy)
     ([ ])                    #Space
     (\[\d{4}\])              #[1959]
     ([ ])                    #Space
     (\([0-9]+p\))            #(430p)
     [.]                      #Period
     ([^\.]+)                 #File Suffix (no period)
/x

And, at this point, you might as well follow Damian Conway's Best Practice recommendations on Perl regular expressions.

$string =~ /
     \A                 #Start of Regular Expression Anchor
     ( .+? )            #Any set of characters (non-greedy)
     ( [ ] )            #Space
     ( \[ \d{4} \] )    #[1959]
     ( [ ] )            #Space
     ( \( [0-9] +p \) ) #(430p)
     ( [.] )            #Period
     ( [^\.]+ )         #File Suffix (no period)
     \Z                 #End of string anchor
/xm;

Since x ignores all white space, I can even add spaces between subgroups on the same line. In this case, ( .*+? ) is just a bit cleaner than (.*+?). Whether ( \( [0-9] +p \) ) or ( \( [0-9]+p \) ) or even ( \([0-9]+p\) ) is easier to understand is up to you.

And, yes the answer looks very much like Sinan's answer.

By the way, as Sinan showed, using the non-greedy regular expression qualifier is able to parse a b c d e [1234] (1080p).mov while using the everything that doesn't include a space sub-expression wouldn't. That's why I said they're not the same.

Upvotes: 3

Sinan Ünür

Reputation: 118118

See also perldoc perlreref.

Here is an updated example to take into account your sample string:

#!/usr/bin/env perl

use strict; use warnings;

my $x = q{This, could be ['a'] s(@m)pl3 file name_with any characters [1923] (720p).avi};

my $pat = qr{
    \A
    (.+?)
    [ ]
    \[ ( [0-9]{4} ) \]
    [ ]
    \( ( [0-9]+ p ) \)
    [.]
    (.+)
    \z
}x;

print "---$_---\n" for $x =~ $pat;

Output:

---This, could be ['a'] s(@m)pl3 file name_with any characters---
---1923---
---720p---
---avi---

Upvotes: 3

TLP

Reputation: 67900

This looks like you are trying to parse a file name. If Sinan guessed correctly it looks something like:

$x = 'a b c d e [1234] (1080p).mov'

Now, you could write a regex to parse this, but with varying characters and a complex regex, it might be painful to maintain and easy to break. So why not make it easier and use split?

my @fields = split ' ', $x;

You can also split on single space / /, but then you risk multiple empty fields if you have multiple spaces anywhere. And it does not strip newlines.

It all depends on what fields you want to capture, of course, but since you didn't mention that, I can't help you with that. Do note that you can parse the array afterwards too:

my @nums  = grep /\d/, @fields;       # anything with numbers
my ($tag) = grep /\[\d+\]/, @fields;  # catch first [1234] type field

The point is that now regexes are easier to write and maintain.

If you are relying on doing matches from the end of the string backwards, you can make use of the reverse function in combination with split, e.g.:

my $xrev   = reverse $x;
my @fields = split ' ', $xrev, 3;

Where the "3" is a limit on how many fields are created, so @fields now only contains three strings.

Upvotes: 0

Sylver

Reputation: 8967

I don't use Perl, so my Regex might need some tweaking, but AFAIK:

(any set of characters) = \S*
(a space) = \s+
('[' + 4 numbers + ']') = \[[0-9]{4}
(a space) = \s+
('(' + an unknown number of numbers + 'p)') = \([0-9]+p\)
(a period) = \.
(file extension)  = .{2,5}

Upvotes: 0

MarcoS

Reputation: 13564

I would write the regex like this (.*?) (\[\d{4}\]) (\(\d+p\))\.(.*)

Haven't tested it, and it could be written better :)

Upvotes: 1

Perl Regex - capture all characters until a pattern

Answers (5)

Related Questions