Minh Phan
Minh Phan

Reputation: 21

Parsing a particular type of string in perl

I have the strings of the following type (where the quotes indicate that they are all on one line):

"AMINO-2,4,6-TRIIODOBENZOIC ACIDS Hugo Holtermann, Baerum, Leif Gunnar Haugen, Oslo, and Knut Wille, Baerum, Norway, assignors to Nye- 5"

"PROCESS FOR THE PRODUCTION OF ETHYLENIC COMPOUNDS Duncan Clark and Percy Hayden, Norton-on-Tees, Eng- 5 land, assignors to ImperiaI Chemical Industries Limited, London, England "

I want to get everything after the title (the part that is all caps). So I would like to get:

"Hugo Holtermann, Baerum, Leif Gunnar Haugen, Oslo, and Knut Wille, Baerum, Norway, assignors to Nye- 5"

"Duncan Clark and Percy Hayden, Norton-on-Tees, Eng- 5 land, assignors to ImperiaI Chemical Industries Limited, London, England "

I have many more strings than these two but the basic formatting is that the title of the invention is always capitalized letters and numbers.

Is there a way to do this with regular expressions in perl?

Upvotes: 2

Views: 370

Answers (5)

Dimanoid
Dimanoid

Reputation: 7279

Titles always ends with capitalized letter+space, so this should work:

/^.+[A-Z]+ (.+)$/;
print $1;

Upvotes: 0

Toto
Toto

Reputation: 91405

How about:

#!/usr/bin/perl
use strict;
use warnings;
use 5.014;

my $re = qr
    /^                # Start of string
    [\p{Lu}\pN, -]+   # one or more uppercase letter or number or comma or space or dash
    (                 # start group 1
      \p{Lu}[\pL.']   # one uppercase letter followed by any letter or dot or apostroph
    )                 # end group
    /x;
while(<DATA>) {
    chomp;
    s/$re/$1/g;       # replace match by group 1
    say;
}


__DATA__
AMINO-2,4,6-TRIIODOBENZOIC ACIDS Hugo Holtermann, Baerum, Leif Gunnar Haugen, Oslo, and Knut Wille, Baerum, Norway, assignors to Nye- 5
PROCESS FOR THE PRODUCTION OF ETHYLENIC COMPOUNDS Duncan Clark and Percy Hayden, Norton-on-Tees, Eng- 5 land, assignors to ImperiaI Chemical Industries Limited, London, England
PROCESS FOR THE PRODUCTION OF ETHYLENIC COMPOUNDS D.Clark
PROCESS FOR THE PRODUCTION OF ETHYLENIC COMPOUNDS O'Connors

output:

Hugo Holtermann, Baerum, Leif Gunnar Haugen, Oslo, and Knut Wille, Baerum, Norway, assignors to Nye- 5
Duncan Clark and Percy Hayden, Norton-on-Tees, Eng- 5 land, assignors to ImperiaI Chemical Industries Limited, London, England
D.Clark
O'Connors

Upvotes: 0

Raghuram
Raghuram

Reputation: 3967

I tried with this and got the output you were expecting

if($ip =~ m/([A-Z0-9,\- ]+)([A-Z]+[a-z]+.*)/)
{
      print "$2";
}

Upvotes: 0

Tim
Tim

Reputation: 35933

Well if it doesn't need to be 100% accurate, I would just look for the first capital followed by the first lowercase letter, and grab the rest of the line.

Something like this (my perl's a little rusty, forgive any syntax errors):

$part_of_line = $full_line =~/([A-Z][a-z].*)/

Upvotes: 1

Ceramic Pot
Ceramic Pot

Reputation: 280

Try this:

$text = "PROCESS FOR THE PRODUCTION OF ETHYLENIC COMPOUNDS Duncan Clark and Percy Hayden, Norton-on-Tees, Eng- 5 land, assignors to ImperiaI Chemical Industries Limited, London, England ";

if($text =~ m/(\b[A-Z0-9-, ]+)\b(.*)/) {
    print "$2";
}

Upvotes: 0

Related Questions