Reputation: 33
I'm looking to come up with a pattern to match this:
(words words words words) | 1234.5678% | (1234)
Where i'd like to preserve (words words words words) as $1 and (1234) as $2
The input files looks like this:
Header Crap | More Header Crap|Header Crap | More Header Crap|(words words words words) | 1234.5678% | (1234) | (words words words words) | 1234.5678% | (1234)(words words words words) | 1234.5678% | (1234) | (words words words words) | 1234.5678% | (1234)(words words words words) | 1234.5678% | (1234) | (words words words words) | 1234.5678% | (1234) | (words words words words) | 1234.5678% | (1234) | (words words words words) | 1234.5678% | (1234)
The issue I believe has something to do with the input. It comes in as one big blob (IE $_ is one big string of data that needs to be parsed through to find the matches)
Things I've tried:
while ($_ =~ /(.*)\|{1}\d*?\.{1}\d*?%{1}\|{1}(\d*)/ {
do stuff with $1 and $2
}
as well as
@matches = $_ =~ /(.*)\|{1}\d*?\.{1}\d*?%{1}\|{1}(\d*)/
And a whole bunch of other similar variations on both of these. I'm just looking for some guidance in the right direction. Any help would be greatly appreciated!
Upvotes: 2
Views: 103
Reputation: 33
Turns out the regex wasn't really the issue. Binmode mode seems to be the answer. I was going from a linux to windows environment (my fault for not mentioning this above :( ) and needed to deal with the weird line endings issue Here is essentially what I end up using:
if (open FILE1, $_) {
binmode($_);
@file = <FILE1>;
foreach (@file) {
if ($_ =~ /(.*?)\|.*?\|(.*?)\|\n/g) {
print "$1\n $2\n";
}
}
}
Thanks for all the help!
Upvotes: 1
Reputation: 48599
use strict;
use warnings;
use 5.014;
my $str = <<END_OF_STRING;
Header Crap | More Header Crap|Header Crap | More Header
Crap|(words words 1 words words) | 1234.5678% | (1234 1) |
(words words 2 words words) | 1234.5678% |(1234 2)(words words 3 words words)
| 1234.5678% | (1234 3) | (words words 4 words words) |
1234.5678% | (1234 4)(words words 5 words words) |
1234.5678% | (1234 5) | (words words 6 words words) | 1234.5678% |
(1234 6) | (words words 7 words words) | 1234.5678% | (1234 7) |
(words words 8 words words) | 1234.5678% | (1234 8)
END_OF_STRING
my $paren_clause = <<END_OF_CLAUSE;
(
[(] #An opening parenthesis
[^)]+ #followed by not a closing parenthesis, one or more times
[)] #followed by a closing parenthesis.
)
END_OF_CLAUSE
my $not_paren_clause = "[^(]+"; #Not an opening parenthesis, one or more times
my $pattern = <<END_OF_PATTERN;
$paren_clause
$not_paren_clause
$paren_clause
END_OF_PATTERN
while ($str =~ /$pattern/xmsg) {
say "$1 $2";
}
--output:--
(words words 1 words words) (1234 1)
(words words 2 words words) (1234 2)
(words words 3 words words) (1234 3)
(words words 4 words words) (1234 4)
(words words 5 words words) (1234 5)
(words words 6 words words) (1234 6)
(words words 7 words words) (1234 7)
(words words 8 words words) (1234 8)
Upvotes: 0
Reputation: 11452
Text::CSV
is often easier for parsing delimited fields of that sort.
Like this, for example:
use Text::CSV;
use String::Util 'trim';
my $csv = Text::CSV->new({
sep_char => '|'
});
$csv->parse('(words words words words) | 1234.5678% | (1234)');
foreach ($csv->fields) {
my $field = trim $_;
print "$field\n";
}
Upvotes: 1
Reputation: 89557
You can use this pattern:
/(\(\w+ \w+ \w+ \w+\)) *\| *\d+(?:\.\d+)?% *\| *(\(\d+\))/
The pattern has this particular that it accept any number of spaces around the pipe |
.
For a more general pattern, you can replace the four \w+
by [^)]+
:
/(\([^)]+\)) *\| *\d+(?:\.\d+)?% *\| *(\(\d+\))/
Example:
#!/usr/bin/perl
use strict;
my $string = 'Header Crap | More Header Crap|Header Crap | More Header Crap|(words words words words) | 1234.5678% | (1234) | (words words words words) | 1234.5678% | (1234)(words words words words) | 1234.5678% | (1234) | (words words words words) | 1234.5678% | (1234)(words words words words) | 1234.5678% | (1234) | (words words words words) | 1234.5678% | (1234) | (words words words words) | 1234.5678% | (1234) | (words words words words) | 1234.5678% | (1234)';
while($string =~ /(\([^)]+\)) *\| *\d+(?:\.\d+)?% *\| *(\(\d+\))/g) {
print $1 . " " . $2 . "\n";
}
Upvotes: 0
Reputation: 30273
Use a non-greedy quantifier here:
while ($_ =~ /(.*?)\|{1}\d*?\.{1}\d*?%{1}\|{1}(\d*)/) {
^
I can't tell whether your parentheses are literal or what, but if literal, you need to escape them:
while ($_ =~ /(\(.*?\))\|{1}\d*?\.{1}\d*?%{1}\|{1}(\(\d*\))/) {
^^ ^^ ^^ ^^
And as @Tim mentioned, there's no need for the {1}
quantifier (reverting literal parentheses):
while ($_ =~ /(.*?)\|\d*?\.\d*?%\|(\d*)/) {
Upvotes: 1