tonguim
tonguim

Reputation:

[Perl]: Read directory and files, and regex

from this string, (champs1 (champs6 donnee_o donnee_f) [(champs2 [] (champs3 _YOJNJeyyyyyyB (champs4 donnee_x)) (debut 144825 25345) (fin 244102 40647)), (champs2 [] (champs3 _FuGNJeyyyyyyB (champs4 donnee_z)) (debut 796443 190570) (fin 145247 42663))] [] [])., i would like to extract the first number after the word "debut", and the first number after the word "fin". I write this:

while (my $readfile = <FILE>) #read each line and check the first value X1 after the word "coorDeb" and the first value X2 after the word "coorFin"
{
    my ($line) = $_;
    chomp ($line);

    ($first, $second)= ~m/coorDeb/\s\S*\s\S*\s\S*\s\S*\s\S*; #CoorDeb first, following by X1

    $X1=$first; $X4=$second;
    $lenght1=$second-$first; # Calculation of the lenght of first segment

    $line  =~ m//coorFin/(\s*)\S*\s*\S*\s*\S*\s*\S*\s*(\S*/); #CoorFin first, following by X1
    $lenght2=$second-$first; # Calculation of the lenght of first segment

    push(@elements1, $lenght1); #Push the lenght into a table to compute the mean of lenght for the segment n°1
    push(@elements2, $lenght2); #Push the lenght into a table to compute the mean of lenght for the segment n°2
}

Can anyone help me with the regex please? Thank you.

Upvotes: 2

Views: 1429

Answers (2)

Nic Gibson
Nic Gibson

Reputation: 7143

If I have understood correctly, you simply need to read a file, and find two values. These values are the series of digits after the word 'fin' and after the word 'debut'. Right now, you are trying to match on these by looking for something that occurs before the string you are interested in. Perhaps you should be looking for the actual information of interest.

In a regular expression, it is almost always better to look for interesting text rather than try to skip non-interesting text. Something like the following will work better.

Note, that I've changed your file read because you were reading into a variable then processing $_ which is (almost definitely) not what you meant.

while (my $line = <FILE>) #read each line from FILE.
{
    chomp ($line);

    # These two lines could be combined but this is a little clearer.
    # Matching against [0-9] because \d matches all unicode digits.
    my ($fin_digits) = $line =~ /fin\s+([0-9]+)/;   
    my ($debut_digits) = $line =~ /debut\s+([0-9]+)/; # as above.

    # Continue processing below...
}

Now, one difference is that your example data shows multiple occurrences of fin and debut in one line. If that is that case, you will need a slightly different regular expression. Let us all know if that really is the case.

UPDATE

Given that you do actually have matching pairs on the same line you might want to use something like the following. Again, I've only put in the regular expression matching and not the processing code. This code actually allows for an arbitrary number of pairs on a single line.

while (my $line = <FILE>) #read each line from FILE.
{
    chomp ($line);

    # These two lines could be combined but this is a little clearer.
    # Matching against [0-9] because \d matches all unicode digits.
    # In list context, m// returns the matches in order, the /g modifier
    # makes this a global match - in a loop this means each pair of
    # matches will be returned in order.
    while (my ($debut, $fin) =~ /debut\s+([0-9]+).+?fin\s+([0-9]+)/g)
    {
           # result processing here.
    }


}

Upvotes: 0

Dave Sherohman
Dave Sherohman

Reputation: 46245

You're making this way too complicated by trying to count fields and calculate offsets in the line and so forth. Assuming you're looking for matched debut/fin pairs, you can use

#!/usr/bin/perl

use strict;
use warnings;

my @elements;
while (<DATA>) {
  my $line = $_;
  push @elements, $line =~ /debut (\d+).*?fin (\d+)/g;
}

print join ',', @elements;
print "\n";
__DATA__
(champs1 (champs6 donnee_o donnee_f) [(champs2 [] (champs3 _YOJNJeyyyyyyB (champs4 donnee_x)) (debut 144825 25345) (fin 244102 40647)), (champs2 [] (champs3 _FuGNJeyyyyyyB (champs4 donnee_z)) (debut 796443 190570) (fin 145247 42663))] [] [])

This code generates the output

144825,244102,796443,145247

($line isn't even really needed, since m// operates on $_ by default, but I left that in there in case you actually need to do other processing on it. And push @elements, /debut (\d+).*?fin (\d+)/g; is a little more obfuscated than I feel is appropriate here.)

If you're not concerned with matching pairs, you can also use two separate arrays and replace the push line with

push @debuts, $line =~ /debut (\d+)/g;
push @fins, $line =~ /fin (\d+)/g;

Upvotes: 4

Related Questions