noscreen
noscreen

Reputation: 21

How to properly parse regular expression in Perl?

I'm currently learning Perl and trying to figure out how to parse using the regular expression. I'm trying to extract the username, month, and time (in minutes) from this sample string:

username1 pts/1 75-30-120-13.lig Wed Oct 12 19:20 - 19:25

This is the code I came up with:

# search for the pattern " pts/"
$_ =~ m/ pts\//;
# store the text before this pattern in the variable userId
$userName = "$`";
# Search for the pattern "Word Number Number:Number - Number:Number"
# This pattern is found at the end of each string
$_ =~ m/(\w+) (\d+) (\d+):(\d+) - (\d+):(\d+)/;


# Extract the month start hour, start minutes, end hour and end minutes
($month, $hours1, $minutes1, $hours2, $minutes2) = ($1, $3, $4, $5, $6);

But I believe it's getting the whitespaces after the username, and I'm not sure what the issue is.

Upvotes: 2

Views: 484

Answers (3)

Andy A.
Andy A.

Reputation: 1452

The most important thing is to know what your source is and can be. So an hour can 1 am can be 1 or 01. This can be a big difference! Mostly there is then more then one solution.

To parse the given string, use only one regex.

^(\w+) pts.* ([A-Z][a-z]+) (\d+) (\d+):(\d+) - (\d+):(\d+)$

Test it on regex101! There you can have a try. Or at debuggex with a bit visualization (use PCRE there).

Update: PCRE is not Perl but very close. See comment.

To get parts out of your match use (). So you will build groups. These groups you can get by $1 ... $9 ... if set by your regex.

Example:

#!/usr/bin/perl

use strict;
use warnings;

my $str = "username1 pts/1 75-30-120-13.lig Wed Oct 12 19:20 - 19:25";

$str =~ /^(\w+) pts.* ([A-Z][a-z]+) (\d+) (\d+):(\d+) - (\d+):(\d+)$/;
print "$1-$2-$3-$4-$5-$6-$7\n";

Of course you can write my $username = $1; ... if you want/need. There are other possibilities too.

A picture of the regex: regex

(by jex.im. Don't use to test, ist javascript!)

Update: Changed the first group, the username from (\w*) to (\w+). So the username must have at least one character. (But it can be just _ or a digit!)

Upvotes: 1

Ted Lyngmo
Ted Lyngmo

Reputation: 117178

You don't need to do more than one regex matching and you don't need to capture the matches that you don't intend to use.

To capture username, month, start and end hours and minutes, you could do something like this:

my $str = 'username1 pts/1 75-30-120-13.lig Wed Oct 12 19:20 - 19:25';

if( my ($username, $month, $hours1, $minutes1, $hours2, $minutes2) =
    $str =~
    m!^(.+)\spts/\d+\s.*\s(\w+)\s\d{1,2}\s(\d{1,2}):(\d{2})\s-\s(\d{1,2}):(\d{2})$! )
{

    print "$username\n".
          "$month\n".
          "$hours1:$minutes1\n".
          "$hours2:$minutes2\n";
}
  • m! - Match operator using ! instead of / to not have to escape / in the expression.
  • ^(.+) - Match from the beginning of the string, capture one or more characters.
  • \spts/\d+\s - Match whitespace, pts/, one or more digits followed by whitespace.
  • .* - Match zero or more characters
  • \s(\w+) - Whitespace followed by a word. Capture the word (which is the month)
  • \s\d{1,2} - Match the day of the month (1 - 2 digits), no capture
  • \s(\d{1,2}) - Whitespace followed by hour1 (1 - 2 digits). If your hours are always 2 digits, make this \d{2} instead.
  • :(\d{2}) - Colon followed by two digits, minutes1, capture the minutes.
  • \s-\s - Whitespace, hyphen, whitespace
  • Followed by hours2 and minutes2 which looks like the first hour/minutes pair.
  • $ - End of line anchor.

Upvotes: 2

Polar Bear
Polar Bear

Reputation: 6798

There are more than one approach to solve the problem.

OP's question mention regular expression, lets look at one variation of such regular expression.

We know that:

  • first element in the line is username consisting of letters and digits -- we can apply \w+
  • then after space we have pts/1 we can apply \S+
  • next in line perhaps a name of a program or file consisting of digits/dashes/dot/letters we can again apply \S+
  • next in line is a day of the week \S+ will fit again nicely or \w+ also works well
  • next in line is a month we can use \S+ or \w+
  • next in line is a day of the month again we can use \S+ or '\d+`
  • now we have a time which consist of hour \d{1,2} followed by : and minutes \d{2} (perhaps start time)
  • next is a separator -
  • and perhaps stop time consisting same parts as the start time, apply same pattern
  • parts we would like to capture should be enclosed into ()

Description:

  • \d any digit
  • \d{min,max}, minimum and maximum following digits
  • \S anything but space
  • \S+ one or more symbols different from space
  • \w word character (letters, digits)

A construction of following structure can be utilized to declare and assign captured information to a list of variables

my($var1,$var2,$var3,...,$var#) = $data =~ /$re/;

In this particular case function split can assist in acquiring information of interest (first case: hour and minute as one block, second case hour and minute as separate entities). It added into the code for demonstration purpose that more than one solution possible.

use strict;
use warnings;
use feature 'say';

my $data = 'username1 pts/1 75-30-120-13.lig Wed Oct 12 19:20 - 19:25';

my $re = qr/(\w+) \S+ \S+ \S+ (\S+) \S+ (\d{1,2}):(\d{2}) - (\d{1,2}):(\d{2})/;

my($username,$month,$h1,$m1,$h2,$m2) = $data =~ /$re/;

say "$username $month, $h1:$m1 - $h2:$m2";

my($username1,$pts,$prog,$wday,$month1,$day,$start,$div,$end) = split(' ',$data);

say "$username1 $month1 $day, $start - $end";

my($username2,$month2,$day2,$h2_1,$m2_1,$h2_2,$m2_2) = (split(/[: ]/,$data))[0,4,5,6,7,9,10];

say "$username2 $month2 $day2, $h2_1:$m2_1 - $h2_2:$m2_2";

Output

username1 Oct, 19:20 - 19:25
username1 Oct 12, 19:20 - 19:25
username1 Oct 12, 19:20 - 19:25

Referense: split, qr//

Upvotes: 1

Related Questions