Landon Statis
Landon Statis

Reputation: 839

Perl - Extract string from text block

I have some text like this which is stored in a variable:

<ACCEPTANCE-DATETIME>20201014084217
ACCESSION NUMBER:       0001225208-20-012454
CONFORMED SUBMISSION TYPE:  4
PUBLIC DOCUMENT COUNT:      1
CONFORMED PERIOD OF REPORT: 20201012
FILED AS OF DATE:       20201014
DATE AS OF CHANGE:      20201014

What I need is to search the string, and extract this line:

ACCESSION NUMBER: 0001225208-20-012454

And, more specifically, the number: 0001225208-20-012454, without the dashes.

Can't seem to find the right syntax:

my $access_no = $txt =~ /ACCESSION NUMBER/m;

That is not working.

Upvotes: 1

Views: 672

Answers (3)

Polar Bear
Polar Bear

Reputation: 6808

There are many ways to extract data of interest and to manipulate them.

Bellow is a code for simple approach

use strict;
use warnings;
use feature 'say';

my $data = do { local $/; <DATA> };

my $num = $1 if $data =~ /ACCESSION NUMBER:\s+(.*)$/m;

say 'Extracted: ' . $num;
$num =~ s/-//g;
say 'Processed: ' . $num;

__DATA__
<ACCEPTANCE-DATETIME>20201014084217
ACCESSION NUMBER:       0001225208-20-012454
CONFORMED SUBMISSION TYPE:  4
PUBLIC DOCUMENT COUNT:      1
CONFORMED PERIOD OF REPORT: 20201012
FILED AS OF DATE:       20201014
DATE AS OF CHANGE:      20201014

Output

Extracted: 0001225208-20-012454
Processed: 000122520820012454

And now more complicated approach to extract event record into data structure for further manipulation

use strict;
use warnings;
use feature 'say';

use Data::Dumper;

my $data = do { local $/; <DATA> };

my %event  = $data =~ /<(.*?)>(.*)$/m;
my %record = $data =~ /(.*?):\s+(.*)$/gm;

$event{record} = \%record;

say '--- Read record '        . '-' x 29;
say $data;
say '--- Content of %record ' . '-' x 22;
say Dumper(\%record);
say '--- Content of %event '  . '-' x 23;
say Dumper(\%event);
say '-' x 45;

my $num0 = $record{'ACCESSION NUMBER'};
my $num1 = $num0;
my $num2 = $num0;
my $num3 = $num0;
my @parts = split '-', $num0;

$num1 =~ s/-//g;
$num2 =~ s/\D//g;
$num3 =~ tr/-//d;

say '$num0 = ' . $num0;
say '$num1 = ' . $num1;
say '$num2 = ' . $num2;
say '$num3 = ' . $num3;
say '@parts = ' . join '', @parts;

__DATA__
<ACCEPTANCE-DATETIME>20201014084217
ACCESSION NUMBER:       0001225208-20-012454
CONFORMED SUBMISSION TYPE:  4
PUBLIC DOCUMENT COUNT:      1
CONFORMED PERIOD OF REPORT: 20201012
FILED AS OF DATE:       20201014
DATE AS OF CHANGE:      20201014

Output

--- Read record -----------------------------
<ACCEPTANCE-DATETIME>20201014084217
ACCESSION NUMBER:       0001225208-20-012454
CONFORMED SUBMISSION TYPE:  4
PUBLIC DOCUMENT COUNT:      1
CONFORMED PERIOD OF REPORT: 20201012
FILED AS OF DATE:       20201014
DATE AS OF CHANGE:      20201014
--- Content of %record ----------------------
$VAR1 = {
          'FILED AS OF DATE' => '20201014',
          'CONFORMED PERIOD OF REPORT' => '20201012',
          'CONFORMED SUBMISSION TYPE' => '4',
          'ACCESSION NUMBER' => '0001225208-20-012454',
          'PUBLIC DOCUMENT COUNT' => '1',
          'DATE AS OF CHANGE' => '20201014'
        };

--- Content of %event -----------------------
$VAR1 = {
          'record' => {
                        'FILED AS OF DATE' => '20201014',
                        'CONFORMED PERIOD OF REPORT' => '20201012',
                        'CONFORMED SUBMISSION TYPE' => '4',
                        'ACCESSION NUMBER' => '0001225208-20-012454',
                        'PUBLIC DOCUMENT COUNT' => '1',
                        'DATE AS OF CHANGE' => '20201014'
                      },
          'ACCEPTANCE-DATETIME' => '20201014084217'
        };

---------------------------------------------
$num0 = 0001225208-20-012454
$num1 = 000122520820012454
$num2 = 000122520820012454
$num3 = 000122520820012454
@parts = 000122520820012454

Upvotes: 1

Timur Shtatland
Timur Shtatland

Reputation: 12395

Here is a solution which parses the entire string into an appropriate data structure (hash), and then changes the desired hash element. This method is longer than the one by l4chsalter, but is potentially easier to maintain and extend, in case you need the rest of the fields as well.

#!/usr/bin/env perl

use strict;
use warnings;
use feature qw( say );

my $txt = <<'EOF';
<ACCEPTANCE-DATETIME>20201014084217
ACCESSION NUMBER:       0001225208-20-012454
CONFORMED SUBMISSION TYPE:  4
PUBLIC DOCUMENT COUNT:      1
CONFORMED PERIOD OF REPORT: 20201012
FILED AS OF DATE:       20201014
DATE AS OF CHANGE:      20201014
EOF

# Parse the entire string into the hash with keys/values:
my %val = $txt =~ m{ ^ ( [^:\n]+ ): \s+ ( \S+ .*? ) $ }gxms;

# Print the hash with the parsed string (optional):
# say "'$_' => '$val{$_}'" for keys %val;

# Remove non-digits from the desired element:
$val{'ACCESSION NUMBER'} =~ tr/0-9//cd;

say $val{'ACCESSION NUMBER'};
# 000122520820012454

my %val = $txt =~ m{ ^ ( [^:\n]+ ): \s+ ( \S+ .*? ) $ }gxms; : Capture the patterns in parenthesis, return them as a LIST and assign the list to the hash %val. Its keys will be the field names and values - the corresponding field values.

The regex uses these modifiers: /g to return multiple matches, /x to disregard whitespace and comments inside the regex for readability, /m to match across multiple lines, and /s to make . match a newline (optional, not used here, but I like to have it in complex regexes for maintainability).

^ ( [^:\n]+ ): matches any characters, starting at the beginning of the line (^), that is not a colon or newline, repeated 1 or more times, until the first colon. The parentheses thus capture the field name.

( \S+ .*? ) $ matches the non-whitespace character followed by any characters 0 or more times until the first end of the line ($). The parentheses thus capture the field value.

SEE ALSO:
perldoc perlre: Perl regular expressions (regexes)
perldoc perlre: Perl regular expressions (regexes): Quantifiers; Character Classes and other Special Escapes; Assertions; Capture groups
perldoc perlrequick: Perl regular expressions quick start

Upvotes: 1

l4chsalter
l4chsalter

Reputation: 41

There are multiple ways of doing it. One way would be:

my $access_no = ''; 
$access_no = $1 . $2 . $3 if $txt =~ m/ACCESSION NUMBER:\s+(\d+)-(\d+)-(\d+)/;

Upvotes: 2

Related Questions