Reputation: 21
I'm currently learning Perl and trying to figure out how to parse using the regular expression. I'm trying to extract the username, month, and time (in minutes) from this sample string:
username1 pts/1 75-30-120-13.lig Wed Oct 12 19:20 - 19:25
This is the code I came up with:
# search for the pattern " pts/"
$_ =~ m/ pts\//;
# store the text before this pattern in the variable userId
$userName = "$`";
# Search for the pattern "Word Number Number:Number - Number:Number"
# This pattern is found at the end of each string
$_ =~ m/(\w+) (\d+) (\d+):(\d+) - (\d+):(\d+)/;
# Extract the month start hour, start minutes, end hour and end minutes
($month, $hours1, $minutes1, $hours2, $minutes2) = ($1, $3, $4, $5, $6);
But I believe it's getting the whitespaces after the username, and I'm not sure what the issue is.
Upvotes: 2
Views: 484
Reputation: 1452
The most important thing is to know what your source is and can be. So an hour can 1 am can be 1
or 01
. This can be a big difference! Mostly there is then more then one solution.
To parse the given string, use only one regex.
^(\w+) pts.* ([A-Z][a-z]+) (\d+) (\d+):(\d+) - (\d+):(\d+)$
Test it on regex101! There you can have a try. Or at debuggex with a bit visualization (use PCRE there).
Update: PCRE is not Perl but very close. See comment.
To get parts out of your match use ()
. So you will build groups. These groups you can get by $1
... $9
... if set by your regex.
Example:
#!/usr/bin/perl
use strict;
use warnings;
my $str = "username1 pts/1 75-30-120-13.lig Wed Oct 12 19:20 - 19:25";
$str =~ /^(\w+) pts.* ([A-Z][a-z]+) (\d+) (\d+):(\d+) - (\d+):(\d+)$/;
print "$1-$2-$3-$4-$5-$6-$7\n";
Of course you can write my $username = $1;
... if you want/need. There are other possibilities too.
(by jex.im. Don't use to test, ist javascript!)
Update: Changed the first group, the username from (\w*)
to (\w+)
. So the username must have at least one character. (But it can be just _
or a digit!)
Upvotes: 1
Reputation: 117178
You don't need to do more than one regex matching and you don't need to capture the matches that you don't intend to use.
To capture username, month, start and end hours and minutes, you could do something like this:
my $str = 'username1 pts/1 75-30-120-13.lig Wed Oct 12 19:20 - 19:25';
if( my ($username, $month, $hours1, $minutes1, $hours2, $minutes2) =
$str =~
m!^(.+)\spts/\d+\s.*\s(\w+)\s\d{1,2}\s(\d{1,2}):(\d{2})\s-\s(\d{1,2}):(\d{2})$! )
{
print "$username\n".
"$month\n".
"$hours1:$minutes1\n".
"$hours2:$minutes2\n";
}
m!
- Match operator using !
instead of /
to not have to escape /
in the expression.^(.+)
- Match from the beginning of the string, capture one or more characters.\spts/\d+\s
- Match whitespace, pts/
, one or more digits followed by whitespace..*
- Match zero or more characters\s(\w+)
- Whitespace followed by a word. Capture the word (which is the month)\s\d{1,2}
- Match the day of the month (1 - 2 digits), no capture\s(\d{1,2})
- Whitespace followed by hour1
(1 - 2 digits). If your hours are always 2 digits, make this \d{2}
instead.:(\d{2})
- Colon followed by two digits, minutes1
, capture the minutes.\s-\s
- Whitespace, hyphen, whitespacehours2
and minutes2
which looks like the first hour/minutes pair.$
- End of line anchor.Upvotes: 2
Reputation: 6798
There are more than one approach to solve the problem.
OP's question mention regular expression, lets look at one variation of such regular expression.
We know that:
\w+
pts/1
we can apply \S+
\S+
\S+
will fit again nicely or \w+
also works well\S+
or \w+
\S+
or '\d+`\d{1,2}
followed by :
and minutes \d{2}
(perhaps start time)-
()
Description:
\d
any digit\d{min,max}
, minimum and maximum following digits\S
anything but space\S+
one or more symbols different from space\w
word character (letters, digits)A construction of following structure can be utilized to declare and assign captured information to a list of variables
my($var1,$var2,$var3,...,$var#) = $data =~ /$re/;
In this particular case function split
can assist in acquiring information of interest (first case: hour and minute as one block, second case hour and minute as separate entities). It added into the code for demonstration purpose that more than one solution possible.
use strict;
use warnings;
use feature 'say';
my $data = 'username1 pts/1 75-30-120-13.lig Wed Oct 12 19:20 - 19:25';
my $re = qr/(\w+) \S+ \S+ \S+ (\S+) \S+ (\d{1,2}):(\d{2}) - (\d{1,2}):(\d{2})/;
my($username,$month,$h1,$m1,$h2,$m2) = $data =~ /$re/;
say "$username $month, $h1:$m1 - $h2:$m2";
my($username1,$pts,$prog,$wday,$month1,$day,$start,$div,$end) = split(' ',$data);
say "$username1 $month1 $day, $start - $end";
my($username2,$month2,$day2,$h2_1,$m2_1,$h2_2,$m2_2) = (split(/[: ]/,$data))[0,4,5,6,7,9,10];
say "$username2 $month2 $day2, $h2_1:$m2_1 - $h2_2:$m2_2";
Output
username1 Oct, 19:20 - 19:25
username1 Oct 12, 19:20 - 19:25
username1 Oct 12, 19:20 - 19:25
Upvotes: 1