Reputation: 21
I have the strings of the following type (where the quotes indicate that they are all on one line):
"AMINO-2,4,6-TRIIODOBENZOIC ACIDS Hugo Holtermann, Baerum, Leif Gunnar Haugen, Oslo, and Knut Wille, Baerum, Norway, assignors to Nye- 5"
"PROCESS FOR THE PRODUCTION OF ETHYLENIC COMPOUNDS Duncan Clark and Percy Hayden, Norton-on-Tees, Eng- 5 land, assignors to ImperiaI Chemical Industries Limited, London, England "
I want to get everything after the title (the part that is all caps). So I would like to get:
"Hugo Holtermann, Baerum, Leif Gunnar Haugen, Oslo, and Knut Wille, Baerum, Norway, assignors to Nye- 5"
"Duncan Clark and Percy Hayden, Norton-on-Tees, Eng- 5 land, assignors to ImperiaI Chemical Industries Limited, London, England "
I have many more strings than these two but the basic formatting is that the title of the invention is always capitalized letters and numbers.
Is there a way to do this with regular expressions in perl?
Upvotes: 2
Views: 370
Reputation: 7279
Titles always ends with capitalized letter+space, so this should work:
/^.+[A-Z]+ (.+)$/;
print $1;
Upvotes: 0
Reputation: 91405
How about:
#!/usr/bin/perl
use strict;
use warnings;
use 5.014;
my $re = qr
/^ # Start of string
[\p{Lu}\pN, -]+ # one or more uppercase letter or number or comma or space or dash
( # start group 1
\p{Lu}[\pL.'] # one uppercase letter followed by any letter or dot or apostroph
) # end group
/x;
while(<DATA>) {
chomp;
s/$re/$1/g; # replace match by group 1
say;
}
__DATA__
AMINO-2,4,6-TRIIODOBENZOIC ACIDS Hugo Holtermann, Baerum, Leif Gunnar Haugen, Oslo, and Knut Wille, Baerum, Norway, assignors to Nye- 5
PROCESS FOR THE PRODUCTION OF ETHYLENIC COMPOUNDS Duncan Clark and Percy Hayden, Norton-on-Tees, Eng- 5 land, assignors to ImperiaI Chemical Industries Limited, London, England
PROCESS FOR THE PRODUCTION OF ETHYLENIC COMPOUNDS D.Clark
PROCESS FOR THE PRODUCTION OF ETHYLENIC COMPOUNDS O'Connors
output:
Hugo Holtermann, Baerum, Leif Gunnar Haugen, Oslo, and Knut Wille, Baerum, Norway, assignors to Nye- 5
Duncan Clark and Percy Hayden, Norton-on-Tees, Eng- 5 land, assignors to ImperiaI Chemical Industries Limited, London, England
D.Clark
O'Connors
Upvotes: 0
Reputation: 3967
I tried with this and got the output you were expecting
if($ip =~ m/([A-Z0-9,\- ]+)([A-Z]+[a-z]+.*)/)
{
print "$2";
}
Upvotes: 0
Reputation: 35933
Well if it doesn't need to be 100% accurate, I would just look for the first capital followed by the first lowercase letter, and grab the rest of the line.
Something like this (my perl's a little rusty, forgive any syntax errors):
$part_of_line = $full_line =~/([A-Z][a-z].*)/
Upvotes: 1
Reputation: 280
Try this:
$text = "PROCESS FOR THE PRODUCTION OF ETHYLENIC COMPOUNDS Duncan Clark and Percy Hayden, Norton-on-Tees, Eng- 5 land, assignors to ImperiaI Chemical Industries Limited, London, England ";
if($text =~ m/(\b[A-Z0-9-, ]+)\b(.*)/) {
print "$2";
}
Upvotes: 0