epiclairs
epiclairs

Reputation: 33

Rather specific Perl Regex

I'm looking to come up with a pattern to match this:

(words words words words) | 1234.5678% | (1234)

Where i'd like to preserve (words words words words) as $1 and (1234) as $2

The input files looks like this:

Header Crap | More Header Crap|Header Crap | More Header Crap|(words words words words) | 1234.5678% | (1234) | (words words words words) | 1234.5678%        |   (1234)(words words words words) | 1234.5678% | (1234) | (words words words words) |   1234.5678% | (1234)(words words words words) | 1234.5678% | (1234) | (words words words words) | 1234.5678% | (1234) | (words words words words) | 1234.5678% | (1234) | (words words words words) | 1234.5678% | (1234)

The issue I believe has something to do with the input. It comes in as one big blob (IE $_ is one big string of data that needs to be parsed through to find the matches)

Things I've tried:

while ($_ =~ /(.*)\|{1}\d*?\.{1}\d*?%{1}\|{1}(\d*)/ {
do stuff with $1 and $2
}

as well as

@matches = $_ =~ /(.*)\|{1}\d*?\.{1}\d*?%{1}\|{1}(\d*)/

And a whole bunch of other similar variations on both of these. I'm just looking for some guidance in the right direction. Any help would be greatly appreciated!

Upvotes: 2

Views: 103

Answers (5)

epiclairs
epiclairs

Reputation: 33

Turns out the regex wasn't really the issue. Binmode mode seems to be the answer. I was going from a linux to windows environment (my fault for not mentioning this above :( ) and needed to deal with the weird line endings issue Here is essentially what I end up using:

if (open FILE1, $_) {
        binmode($_);
            @file = <FILE1>;
            foreach (@file) {
                if ($_ =~ /(.*?)\|.*?\|(.*?)\|\n/g) {
                    print "$1\n $2\n";
                }
            }
        }   

Thanks for all the help!

Upvotes: 1

7stud
7stud

Reputation: 48599

use strict;
use warnings;
use 5.014;  

my $str = <<END_OF_STRING;
Header Crap | More Header Crap|Header Crap | More Header
Crap|(words words 1 words words) | 1234.5678% | (1234 1) | 
(words words 2 words words) | 1234.5678% |(1234 2)(words words 3 words words) 
| 1234.5678% | (1234 3) | (words words 4 words words) |  
1234.5678% | (1234 4)(words words 5 words words) | 
1234.5678% | (1234 5) | (words words 6 words words) | 1234.5678% |
(1234 6) | (words words 7 words words) | 1234.5678% | (1234 7) | 
(words words 8 words words) | 1234.5678% | (1234 8)
END_OF_STRING

my $paren_clause = <<END_OF_CLAUSE;
(
    [(]     #An opening parenthesis
    [^)]+   #followed by not a closing parenthesis, one or more times
    [)]     #followed by a closing parenthesis.
)
END_OF_CLAUSE

my $not_paren_clause = "[^(]+";  #Not an opening parenthesis, one or more times

my $pattern = <<END_OF_PATTERN;
    $paren_clause 
    $not_paren_clause
    $paren_clause
END_OF_PATTERN

while ($str =~ /$pattern/xmsg) {
    say "$1 $2";
}

--output:--
(words words 1 words words) (1234 1)
(words words 2 words words) (1234 2)
(words words 3 words words) (1234 3)
(words words 4 words words) (1234 4)
(words words 5 words words) (1234 5)
(words words 6 words words) (1234 6)
(words words 7 words words) (1234 7)
(words words 8 words words) (1234 8)

Upvotes: 0

rutter
rutter

Reputation: 11452

Text::CSV is often easier for parsing delimited fields of that sort.

Like this, for example:

use Text::CSV;
use String::Util 'trim';

my $csv = Text::CSV->new({
    sep_char => '|'
});

$csv->parse('(words words words words) | 1234.5678% | (1234)');
foreach ($csv->fields) {
    my $field = trim $_;
    print "$field\n";
}

Upvotes: 1

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89557

You can use this pattern:

/(\(\w+ \w+ \w+ \w+\)) *\| *\d+(?:\.\d+)?% *\| *(\(\d+\))/

The pattern has this particular that it accept any number of spaces around the pipe |.

For a more general pattern, you can replace the four \w+ by [^)]+:

/(\([^)]+\)) *\| *\d+(?:\.\d+)?% *\| *(\(\d+\))/

Example:

#!/usr/bin/perl

use strict;

my $string = 'Header Crap | More Header Crap|Header Crap | More Header Crap|(words words words words) | 1234.5678% | (1234) | (words words words words) | 1234.5678%        |   (1234)(words words words words) | 1234.5678% | (1234) | (words words words words) |   1234.5678% | (1234)(words words words words) | 1234.5678% | (1234) | (words words words words) | 1234.5678% | (1234) | (words words words words) | 1234.5678% | (1234) | (words words words words) | 1234.5678% | (1234)';

while($string =~ /(\([^)]+\)) *\| *\d+(?:\.\d+)?% *\| *(\(\d+\))/g) {
    print $1 . " " . $2 . "\n";
}

Upvotes: 0

Andrew Cheong
Andrew Cheong

Reputation: 30273

Use a non-greedy quantifier here:

while ($_ =~ /(.*?)\|{1}\d*?\.{1}\d*?%{1}\|{1}(\d*)/) {
                 ^

I can't tell whether your parentheses are literal or what, but if literal, you need to escape them:

while ($_ =~ /(\(.*?\))\|{1}\d*?\.{1}\d*?%{1}\|{1}(\(\d*\))/) {
               ^^   ^^                              ^^  ^^

And as @Tim mentioned, there's no need for the {1} quantifier (reverting literal parentheses):

while ($_ =~ /(.*?)\|\d*?\.\d*?%\|(\d*)/) {

Upvotes: 1

Related Questions