Reputation: 1232
I'm trying to extract the best paying job titles from this sample text:
Data Scientist
#1 in Best Paying Jobs
5,100 Projected Jobs $250,000 Median Salary 0.5% Unemployment Rate
Programmer
#2 in Best Paying Jobs
4,000 Projected Jobs $240,000 Median Salary 1.0% Unemployment Rate
SAP Module Consultant
#3 in Best Paying Jobs
3,000 Projected Jobs $220,000 Median Salary 0.2% Unemployment Rate
by using the following regex and Perl code.
use File::Glob;
local $/ = undef;
my $file = @ARGV[0];
open INPUT, "<", $file
or die "Couldn't open file $!\n";
my $content = <INPUT>;
my $regex = "^\w+(\w+)*$\n\n#(\d+)";
my @arr_found = ($content =~ m/^\w+(\w+)*$\n\n#(\d+)/g);
close (INPUT);
Q1: The regex finds only the one-word titles*. How to make it find the multiple word titles and how to forward (i.e. how to properly capture) those found titles into the Perl array?
Q2: I defined the regex into a Perl variable and tried to use that variable for the regex operation like:
my @arr_found = ($content =~ m/"$regex"/g);
but it gave error. How to make it?
* When I apply the regex ^\w+(\w+)*$\n\n#(\d+)
on Sublime Text 2, it finds only the one word titles.
Upvotes: 3
Views: 1309
Reputation: 66873
Why not process line-by-line, simple and easy
use warnings;
use strict;
use feature 'say';
my $file = shift || die "Usage: $0 file\n";
open my $fh, '<', $file or die "Can't open $file: $!";
my (@jobs, $prev_line);
while (my $line = <$fh>) {
chomp $line;
next if not $line =~ /\S/;
if ($line =~ /^\s*#[0-9]/) {
push @jobs, $prev_line;
}
$prev_line = $line;
}
say for @jobs;
This relies on the requirement that the #N
line is the first non-empty line after the jobs title.
It prints
Data Scientist Programmer SAP Module Consultant
The question doesn't say whether rankings are wanted as well but there is a hint in the regex that they may be. Then, assuming that the ordering in the file is "correct" you can iterate over the array indices and print elements (titles) with their indices (rank).
Or, to be certain, capture them in the regex, /^\s*#([0-9]+)/
. Then you can directly print both the title and its rank, or perhaps store them in a hash with key-value pairs rank => title
.
As for the regex, there are a few needed corrections. To compose a regex ahead of matching, what is a great idea, you want the qr operator. To work with multi-line strings you need the /m
modifier. (See perlretut.) The regex itself needs fixing. For example
my $regex = qr/^(.+)?(?:\n\s*)+\n\s*#\s*[0-9]/m;
my @titles = $content =~ /$regex/g
what captures a line followed by at least one empty line and then #N
on another line.
If the ranking of titles is needed as well then capture it, too, and store in a hash
my $regex = qr/^(.+)?(?:\n\s*)+\n\s*#\s*([0-9]+)/m;
my %jobs = reverse $content =~ /$regex/g;
or maybe better not push it with reverse
-ing the list of matches but iterate through pairs instead
my %jobs;
while ($content =~ /$regex/g) {
$jobs{$2} = $1;
}
since with this we can check our "catch" at each iteration, do other processing, etc. Then you can sort the keys to print in order
say "#$_ $jobs{$_}" for sort { $a <=> $b } keys %jobs;
and just in general pick jobs by their rank as needed.
I think that it's fair to say that the regex here is much more complex than the first program.
Upvotes: 4
Reputation: 5952
Answers for your questions:
you are capturing the second word only and you do not allow for space in between them. That's why it won't match e.g. Data Scientist
use the qr//
operator to compile regexes with dynamic content. The error stems from the $
in the middle of the regex which Perl regex compiler assumes you got wrong, because $
should only come at the end of a regex.
The following code should achieve what you want. Note the two-step approach:
Find matching text
^
)\w+(?:\s+\w+)*
, no need to capture match)\n\n
)#
followed by a number (\d+
)/g
) and treat strings as multiple lines (/m
, i.e. ^
will match any beginning of a line in the input text)Split match at line ends (\n
) and extract the 1st and the 3rd field
$match
will contain three lines, this approach is much easier than writing another regex.#!/usr/bin/perl
use strict;
use warnings;
use feature qw(say);
use File::Slurper qw(read_text);
my $input = read_text($ARGV[0])
or die "slurp: $!\n";
my $regex = qr/^(\w+(?:\s+\w+)*\n\n#\d+)/m;
foreach my $match ($input =~ /$regex/g) {
#say $match;
my($title, undef, $rank) = split("\n", $match);
$rank =~ s/^#//;
say "MATCH '${title}' '${rank}'";
}
exit 0;
Test run over the example text you provided in your question.
$ perl dummy.pl dummy.txt
MATCH 'Data Scientist' '1'
MATCH 'Programmer' '2'
MATCH 'SAP Module Consultant' '3'
UNICODE UPDATE: as suggested by @Jan's answer the code can be improved like this:
my $regex = qr/^(\w+(?:\s+\w+)*\R\R#\d+)/m;
...
my($title, undef, $rank) = split(/\R/, $match);
That is probably the more generic approach, as UTF-8
is the default for File::Slurper::read_text()
anyway...
Upvotes: 2
Reputation: 43169
You were not taking whitespaces (as in Data Scientist
) into account:
^\w+.*$\R+#(\d+)
\R
is equal to (?>\r\n|\n|\r|\f|\x0b|\x85)
(matches Unicode newlines sequences).
Upvotes: 2