Terry
Terry

Reputation: 1232

How to capture multiple words using regex on this particular text?

I'm trying to extract the best paying job titles from this sample text:

Data Scientist

#1 in Best Paying Jobs

5,100  Projected Jobs $250,000 Median Salary 0.5% Unemployment Rate

Programmer

#2 in Best Paying Jobs

4,000 Projected Jobs $240,000 Median Salary 1.0% Unemployment Rate

SAP Module Consultant

#3 in Best Paying Jobs

3,000 Projected Jobs $220,000 Median Salary 0.2% Unemployment Rate

by using the following regex and Perl code.

use File::Glob;
local $/ = undef;
my $file = @ARGV[0];

open INPUT, "<", $file
    or die "Couldn't open file $!\n";

my $content = <INPUT>;

my $regex = "^\w+(\w+)*$\n\n#(\d+)";

my @arr_found = ($content =~ m/^\w+(\w+)*$\n\n#(\d+)/g);

close (INPUT);

Q1: The regex finds only the one-word titles*. How to make it find the multiple word titles and how to forward (i.e. how to properly capture) those found titles into the Perl array?

Q2: I defined the regex into a Perl variable and tried to use that variable for the regex operation like:

my @arr_found = ($content =~ m/"$regex"/g);

but it gave error. How to make it?

* When I apply the regex ^\w+(\w+)*$\n\n#(\d+) on Sublime Text 2, it finds only the one word titles.

Upvotes: 3

Views: 1309

Answers (3)

zdim
zdim

Reputation: 66873

Why not process line-by-line, simple and easy

use warnings;
use strict;
use feature 'say';

my $file = shift || die "Usage: $0 file\n";

open my $fh, '<', $file  or die "Can't open $file: $!";

my (@jobs, $prev_line);

while (my $line = <$fh>) { 
    chomp $line;
    next if not $line =~ /\S/;

    if ($line =~ /^\s*#[0-9]/) {
        push @jobs, $prev_line;
    }   

    $prev_line = $line;
}

say for @jobs;

This relies on the requirement that the #N line is the first non-empty line after the jobs title.

It prints

Data Scientist
Programmer
SAP Module Consultant

The question doesn't say whether rankings are wanted as well but there is a hint in the regex that they may be. Then, assuming that the ordering in the file is "correct" you can iterate over the array indices and print elements (titles) with their indices (rank).

Or, to be certain, capture them in the regex, /^\s*#([0-9]+)/. Then you can directly print both the title and its rank, or perhaps store them in a hash with key-value pairs rank => title.


As for the regex, there are a few needed corrections. To compose a regex ahead of matching, what is a great idea, you want the qr operator. To work with multi-line strings you need the /m modifier. (See perlretut.) The regex itself needs fixing. For example

my $regex  = qr/^(.+)?(?:\n\s*)+\n\s*#\s*[0-9]/m;
my @titles = $content =~ /$regex/g

what captures a line followed by at least one empty line and then #N on another line.

If the ranking of titles is needed as well then capture it, too, and store in a hash

my $regex = qr/^(.+)?(?:\n\s*)+\n\s*#\s*([0-9]+)/m;
my %jobs  = reverse  $content =~ /$regex/g;

or maybe better not push it with reverse-ing the list of matches but iterate through pairs instead

my %jobs;
while ($content =~ /$regex/g) {
    $jobs{$2} = $1;
}

since with this we can check our "catch" at each iteration, do other processing, etc. Then you can sort the keys to print in order

say "#$_ $jobs{$_}" for sort { $a <=> $b } keys %jobs;

and just in general pick jobs by their rank as needed.

I think that it's fair to say that the regex here is much more complex than the first program.

Upvotes: 4

Stefan Becker
Stefan Becker

Reputation: 5952

Answers for your questions:

  1. you are capturing the second word only and you do not allow for space in between them. That's why it won't match e.g. Data Scientist

  2. use the qr// operator to compile regexes with dynamic content. The error stems from the $ in the middle of the regex which Perl regex compiler assumes you got wrong, because $ should only come at the end of a regex.

The following code should achieve what you want. Note the two-step approach:

  1. Find matching text

    • beginning of a line (^)
    • one-or-more words separated by white space (\w+(?:\s+\w+)*, no need to capture match)
    • 2 line ends (\n\n)
    • # followed by a number (\d+)
    • apply regex multiple times (/g) and treat strings as multiple lines (/m, i.e. ^ will match any beginning of a line in the input text)
  2. Split match at line ends (\n) and extract the 1st and the 3rd field

    • as we know $match will contain three lines, this approach is much easier than writing another regex.
#!/usr/bin/perl
use strict;
use warnings;

use feature qw(say);
use File::Slurper qw(read_text);

my $input = read_text($ARGV[0])
    or die "slurp: $!\n";

my $regex = qr/^(\w+(?:\s+\w+)*\n\n#\d+)/m;

foreach my $match ($input =~ /$regex/g) {
    #say $match;
    my($title, undef, $rank) = split("\n", $match);
    $rank =~ s/^#//;
    say "MATCH '${title}' '${rank}'";
}

exit 0;

Test run over the example text you provided in your question.

$ perl dummy.pl dummy.txt
MATCH 'Data Scientist' '1'
MATCH 'Programmer' '2'
MATCH 'SAP Module Consultant' '3'

UNICODE UPDATE: as suggested by @Jan's answer the code can be improved like this:

my $regex = qr/^(\w+(?:\s+\w+)*\R\R#\d+)/m;
...
    my($title, undef, $rank) = split(/\R/, $match);

That is probably the more generic approach, as UTF-8 is the default for File::Slurper::read_text() anyway...

Upvotes: 2

Jan
Jan

Reputation: 43169

You were not taking whitespaces (as in Data Scientist) into account:

^\w+.*$\R+#(\d+)

See a demo on regex101.com.


\R is equal to (?>\r\n|\n|\r|\f|\x0b|\x85) (matches Unicode newlines sequences).

Upvotes: 2

Related Questions