Reputation: 1285
I'm trying to extract 4 chunks of information from a string. The string is the name of a file with the extension included. The first group can contain any valid characters until the space before the second group is reached. The second group of data will be 4 numbers contained inside of a set of square brackets. That group is separated by the first group by a space. The third group could be either 3 or 4 numbers followed by the letter "p". This groups is also separated by a space from the previous group. The last group is simply the file extension.
Here's an example:
This, could be ['a'] s(@m)pl3 file name_with any characters [1923] (720p).avi
That would then need to be parsed to be:
$1 = This, could be ['a'] s(@m)pl3 file name_with any characters
$2 = 1923
$3 = 720p
$4 = avi
Upvotes: 0
Views: 8142
Reputation: 107040
Whether Perl or not, sometimes the problem with a regular expression is its greediness. Let's say I want to capture the first name of someone and the string looked like this:
Bob Baker
I could use this regular expression:
sed 's/^\(.*)\ .*$/\1/'
That would work with Bob Baker, but not with Bob Barry Baker. The problem is that my regular expression is greedy and will select all of the characters up to the last space, so I would end up not with Bob
but with Bob Baker
. A common way to solve this is to specify all the characters except for the one you don't want:
sed 's/^\([^ ]*)\ .*$/\1/'
In this case, I am specifying any set of characters not including a space. This will change both Bob Baker
and Bob Rudolph Baker
to just Bob
.
Perl has another way of specifying an non-greedy regular expression. In Perl, you add a ?
to your sub-expression you want to be not greedy. In the above example, both of these will change a string containing Bob Barry Baker
to just Bob
:
$string =~ s/^([^ ]+) .*$/$1/;
$string =~ s/^(.+?) .*$/$1/;
By the way, these are not equivalent!
With the everything but a space regex, I could do this:
$string =~ /^([^ ]+)( )(\[\d{4}\])( )(\(\d+p\))(\.)([^.]+)/
With the non-greedy qualifier:
$string =~ /^(.+?)( )(\[\d{4}\])( )(\(\d+p\))(\.)(.*)/
And, using the x
qualifier which allows you to put the same regular expression over multiple lines which is nice because you can add comments to help explain what you're doing:
$string =~ /
^(.+?) #Any set of characters (non-greedy)
([ ]) #Space
(\[\d{4}\]) #[1959]
([ ]) #Space
(\([0-9]+p\)) #(430p)
[.] #Period
([^\.]+) #File Suffix (no period)
/x
And, at this point, you might as well follow Damian Conway's Best Practice recommendations on Perl regular expressions.
$string =~ /
\A #Start of Regular Expression Anchor
( .+? ) #Any set of characters (non-greedy)
( [ ] ) #Space
( \[ \d{4} \] ) #[1959]
( [ ] ) #Space
( \( [0-9] +p \) ) #(430p)
( [.] ) #Period
( [^\.]+ ) #File Suffix (no period)
\Z #End of string anchor
/xm;
Since x
ignores all white space, I can even add spaces between subgroups on the same line. In this case, ( .*+? )
is just a bit cleaner than (.*+?)
. Whether ( \( [0-9] +p \) )
or ( \( [0-9]+p \) )
or even ( \([0-9]+p\) )
is easier to understand is up to you.
And, yes the answer looks very much like Sinan's answer.
By the way, as Sinan showed, using the non-greedy regular expression qualifier is able to parse a b c d e [1234] (1080p).mov
while using the everything that doesn't include a space sub-expression wouldn't. That's why I said they're not the same.
Upvotes: 3
Reputation: 118118
See also perldoc perlreref.
Here is an updated example to take into account your sample string:
#!/usr/bin/env perl
use strict; use warnings;
my $x = q{This, could be ['a'] s(@m)pl3 file name_with any characters [1923] (720p).avi};
my $pat = qr{
\A
(.+?)
[ ]
\[ ( [0-9]{4} ) \]
[ ]
\( ( [0-9]+ p ) \)
[.]
(.+)
\z
}x;
print "---$_---\n" for $x =~ $pat;
Output:
---This, could be ['a'] s(@m)pl3 file name_with any characters--- ---1923--- ---720p--- ---avi---
Upvotes: 3
Reputation: 67900
This looks like you are trying to parse a file name. If Sinan guessed correctly it looks something like:
$x = 'a b c d e [1234] (1080p).mov'
Now, you could write a regex to parse this, but with varying characters and a complex regex, it might be painful to maintain and easy to break. So why not make it easier and use split
?
my @fields = split ' ', $x;
You can also split on single space / /
, but then you risk multiple empty fields if you have multiple spaces anywhere. And it does not strip newlines.
It all depends on what fields you want to capture, of course, but since you didn't mention that, I can't help you with that. Do note that you can parse the array afterwards too:
my @nums = grep /\d/, @fields; # anything with numbers
my ($tag) = grep /\[\d+\]/, @fields; # catch first [1234] type field
The point is that now regexes are easier to write and maintain.
If you are relying on doing matches from the end of the string backwards, you can make use of the reverse
function in combination with split
, e.g.:
my $xrev = reverse $x;
my @fields = split ' ', $xrev, 3;
Where the "3" is a limit on how many fields are created, so @fields
now only contains three strings.
Upvotes: 0
Reputation: 8967
I don't use Perl, so my Regex might need some tweaking, but AFAIK:
(any set of characters) = \S*
(a space) = \s+
('[' + 4 numbers + ']') = \[[0-9]{4}
(a space) = \s+
('(' + an unknown number of numbers + 'p)') = \([0-9]+p\)
(a period) = \.
(file extension) = .{2,5}
Upvotes: 0
Reputation: 13564
I would write the regex like this (.*?) (\[\d{4}\]) (\(\d+p\))\.(.*)
Haven't tested it, and it could be written better :)
Upvotes: 1