Reputation: 839
I have some text like this which is stored in a variable:
<ACCEPTANCE-DATETIME>20201014084217
ACCESSION NUMBER: 0001225208-20-012454
CONFORMED SUBMISSION TYPE: 4
PUBLIC DOCUMENT COUNT: 1
CONFORMED PERIOD OF REPORT: 20201012
FILED AS OF DATE: 20201014
DATE AS OF CHANGE: 20201014
What I need is to search the string, and extract this line:
ACCESSION NUMBER: 0001225208-20-012454
And, more specifically, the number: 0001225208-20-012454, without the dashes.
Can't seem to find the right syntax:
my $access_no = $txt =~ /ACCESSION NUMBER/m;
That is not working.
Upvotes: 1
Views: 672
Reputation: 6808
There are many ways to extract data of interest and to manipulate them.
Bellow is a code for simple approach
use strict;
use warnings;
use feature 'say';
my $data = do { local $/; <DATA> };
my $num = $1 if $data =~ /ACCESSION NUMBER:\s+(.*)$/m;
say 'Extracted: ' . $num;
$num =~ s/-//g;
say 'Processed: ' . $num;
__DATA__
<ACCEPTANCE-DATETIME>20201014084217
ACCESSION NUMBER: 0001225208-20-012454
CONFORMED SUBMISSION TYPE: 4
PUBLIC DOCUMENT COUNT: 1
CONFORMED PERIOD OF REPORT: 20201012
FILED AS OF DATE: 20201014
DATE AS OF CHANGE: 20201014
Output
Extracted: 0001225208-20-012454
Processed: 000122520820012454
And now more complicated approach to extract event record into data structure for further manipulation
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my $data = do { local $/; <DATA> };
my %event = $data =~ /<(.*?)>(.*)$/m;
my %record = $data =~ /(.*?):\s+(.*)$/gm;
$event{record} = \%record;
say '--- Read record ' . '-' x 29;
say $data;
say '--- Content of %record ' . '-' x 22;
say Dumper(\%record);
say '--- Content of %event ' . '-' x 23;
say Dumper(\%event);
say '-' x 45;
my $num0 = $record{'ACCESSION NUMBER'};
my $num1 = $num0;
my $num2 = $num0;
my $num3 = $num0;
my @parts = split '-', $num0;
$num1 =~ s/-//g;
$num2 =~ s/\D//g;
$num3 =~ tr/-//d;
say '$num0 = ' . $num0;
say '$num1 = ' . $num1;
say '$num2 = ' . $num2;
say '$num3 = ' . $num3;
say '@parts = ' . join '', @parts;
__DATA__
<ACCEPTANCE-DATETIME>20201014084217
ACCESSION NUMBER: 0001225208-20-012454
CONFORMED SUBMISSION TYPE: 4
PUBLIC DOCUMENT COUNT: 1
CONFORMED PERIOD OF REPORT: 20201012
FILED AS OF DATE: 20201014
DATE AS OF CHANGE: 20201014
Output
--- Read record -----------------------------
<ACCEPTANCE-DATETIME>20201014084217
ACCESSION NUMBER: 0001225208-20-012454
CONFORMED SUBMISSION TYPE: 4
PUBLIC DOCUMENT COUNT: 1
CONFORMED PERIOD OF REPORT: 20201012
FILED AS OF DATE: 20201014
DATE AS OF CHANGE: 20201014
--- Content of %record ----------------------
$VAR1 = {
'FILED AS OF DATE' => '20201014',
'CONFORMED PERIOD OF REPORT' => '20201012',
'CONFORMED SUBMISSION TYPE' => '4',
'ACCESSION NUMBER' => '0001225208-20-012454',
'PUBLIC DOCUMENT COUNT' => '1',
'DATE AS OF CHANGE' => '20201014'
};
--- Content of %event -----------------------
$VAR1 = {
'record' => {
'FILED AS OF DATE' => '20201014',
'CONFORMED PERIOD OF REPORT' => '20201012',
'CONFORMED SUBMISSION TYPE' => '4',
'ACCESSION NUMBER' => '0001225208-20-012454',
'PUBLIC DOCUMENT COUNT' => '1',
'DATE AS OF CHANGE' => '20201014'
},
'ACCEPTANCE-DATETIME' => '20201014084217'
};
---------------------------------------------
$num0 = 0001225208-20-012454
$num1 = 000122520820012454
$num2 = 000122520820012454
$num3 = 000122520820012454
@parts = 000122520820012454
Upvotes: 1
Reputation: 12395
Here is a solution which parses the entire string into an appropriate data structure (hash), and then changes the desired hash element. This method is longer than the one by l4chsalter, but is potentially easier to maintain and extend, in case you need the rest of the fields as well.
#!/usr/bin/env perl
use strict;
use warnings;
use feature qw( say );
my $txt = <<'EOF';
<ACCEPTANCE-DATETIME>20201014084217
ACCESSION NUMBER: 0001225208-20-012454
CONFORMED SUBMISSION TYPE: 4
PUBLIC DOCUMENT COUNT: 1
CONFORMED PERIOD OF REPORT: 20201012
FILED AS OF DATE: 20201014
DATE AS OF CHANGE: 20201014
EOF
# Parse the entire string into the hash with keys/values:
my %val = $txt =~ m{ ^ ( [^:\n]+ ): \s+ ( \S+ .*? ) $ }gxms;
# Print the hash with the parsed string (optional):
# say "'$_' => '$val{$_}'" for keys %val;
# Remove non-digits from the desired element:
$val{'ACCESSION NUMBER'} =~ tr/0-9//cd;
say $val{'ACCESSION NUMBER'};
# 000122520820012454
my %val = $txt =~ m{ ^ ( [^:\n]+ ): \s+ ( \S+ .*? ) $ }gxms;
: Capture the patterns in parenthesis, return them as a LIST and assign the list to the hash %val
. Its keys will be the field names and values - the corresponding field values.
The regex uses these modifiers: /g
to return multiple matches, /x
to disregard whitespace and comments inside the regex for readability, /m
to match across multiple lines, and /s
to make .
match a newline (optional, not used here, but I like to have it in complex regexes for maintainability).
^ ( [^:\n]+ ):
matches any characters, starting at the beginning of the line (^
), that is not a colon or newline, repeated 1 or more times, until the first colon. The parentheses thus capture the field name.
( \S+ .*? ) $
matches the non-whitespace character followed by any characters 0 or more times until the first end of the line ($
). The parentheses thus capture the field value.
SEE ALSO:
perldoc perlre
: Perl regular expressions (regexes)
perldoc perlre
: Perl regular expressions (regexes): Quantifiers; Character Classes and other Special Escapes; Assertions; Capture groups
perldoc perlrequick
: Perl regular expressions quick start
Upvotes: 1
Reputation: 41
There are multiple ways of doing it. One way would be:
my $access_no = '';
$access_no = $1 . $2 . $3 if $txt =~ m/ACCESSION NUMBER:\s+(\d+)-(\d+)-(\d+)/;
Upvotes: 2