Reputation: 25986
When I run the script below, I get
$VAR1 = [
'ok0.ok]][[file:ok1.ok',
undef,
undef,
'ok2.ok|dgdfg]][[file:ok3.ok',
undef,
undef,
undef,
undef,
undef,
undef,
undef,
undef,
undef,
undef,
undef,
undef,
undef
];
where I was hoping for ok0.ok ok1.ok ok2.ok ok3.ok
and ideally also ok4.ok ok5.ok ok6.ok ok7.ok
Question
Can anyone see what I am doing wrong?
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $html = "sdfgdfg[[file:ok0.ok]][[file:ok1.ok ]] [[file:ok2.ok|dgdfg]][[file:ok3.ok |dfgdfgg]] [[media:ok4.ok]] [[media:ok5.ok ]] [[media:ok6.ok|dgdfg]] [[media:ok7.ok |dfgdfgg]]ggg";
my @seen = ($html =~ /file:(.*?) |\||\]/g);
print Dumper \@seen;
Upvotes: 2
Views: 124
Reputation: 15184
this is what your regex does:
...
my $ss = qr {
file: # start with file + column as anchor
( # start capture group
.*? # use any character in a non-greedy sweep
) # end capture group
\s # end non-greedy search on a **white space**
| # OR expression encountered up to here with:
\| # => | charachter
| # OR expression encountered up to here with:
\] # => ] charachter
}x;
my @seen = $html =~ /$ss/g;
...
and this is what your regex is supposed to do:
...
my $rb = qr {
\w : # alphanumeric + column as front anchor
( # start capture group
[^]| ]+ # the terminating sequence
) # end capture group
}x;
my @seen = $html =~ /$rb/g;
...
If you want a short, concise regex and know what you do, you could drop the capturing group altogether and use full capture chunk in list context together with positive lookbehind:
...
my @seen = $html =~ /(?<=(?:.file|media):)[^] |]+/g; # no cature group ()
...
or, if no other structure in your data as shown is to be dealt with, use the :
as only anchor:
...
my @seen = $html =~ /(?<=:)[^] |]+/g; # no capture group and short
...
Regards
rbo
Upvotes: 1
Reputation: 144
I hope this is what you required.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $string = "sdfgdfg[[file:ok0.ok]][[file:ok1.ok ]] [[file:ok2.ok|dgdfg]][[file:ok3.ok |dfgdfgg]] [[media:ok4.ok]] [[media:ok5.ok ]] [[media:ok6.ok|dgdfg]] [[media:ok7.ok |dfgdfgg]]ggg";
my @matches;
@matches = $string =~ m/ok\d\.ok/g;
print Dumper @matches;
Output:
$VAR1 = 'ok0.ok';
$VAR2 = 'ok1.ok';
$VAR3 = 'ok2.ok';
$VAR4 = 'ok3.ok';
$VAR5 = 'ok4.ok';
$VAR6 = 'ok5.ok';
$VAR7 = 'ok6.ok';
$VAR8 = 'ok7.ok';
Regards, Kiran.
Upvotes: 0
Reputation: 39763
It looks like you are trying to match everything starting with file:
and ending with either a space, a pipe or a closing square bracket.
Your OR-statement at the end of the regexp needs to be between (square) brackets itself though:
my @seen = ($html =~ /file:(.*?)[] |]/g);
If you want the media: blocks as well, OR the file part. You might want a non-capturing group here:
my @seen = ($html =~ /(?:file|media):(.*?)[] |]/g);
The first statement will capture everything between 'file:' and a ]
, |
or .
The second statement does the same, but with both file and media. We use a non-capturing group (?:group)
instead of (group)
so the word is not put into your @seen
.
Upvotes: 1
Reputation: 126722
Depending on the possible characters in the file name, I think you probably want
my @seen = $html =~ /(?:file|media):([\w.]+)/g;
which captures all of ok0.ok
through to ok7.ok
.
It relies on the file names containing alphanumeric characters plus underscore and dot.
Upvotes: 0
Reputation: 13942
A negated character class can simplify things a bit, I think. Be explicit as to your anchors (file:, or media:), and explicit as to what terminates the sequence (a space, pipe, or closing bracket). Then capture.
my @seen = $html =~ m{(?:file|media):([^\|\s\]]+)}g;
Explained:
my @seen = $html =~ m{
(?:file|media): # Match either 'file' or 'media', don't capture, ':'
( [^\|\s\]]+ ) # Match and capture one or more, anything except |\s]
}gx;
Capturing stops as soon as ]
, |
, or \s
is encountered.
Upvotes: 2