NilsonCain
NilsonCain

Reputation: 111

Perl: Search & Replace within a foreach loop

perhaps someone can help me out. I need to do a search and replace on a given string, finding any occurance of one of a list of things, and inserting a carriage return before it.

I'm providing a sample string, and my attempt at solving the problem.

Sample Input:

MSH|^~\&|PCM|A|NSG|A|20120613081122|DoNotBundle|ORM^O01|1133316|P|2.2|||AL|NEPID|1|1234567^PI^PE|345235^ST02A^MR^A~02340395^ST02^PI||HSM^AERHART||19510418000000|F||||||||||1215200001^A|111-22-3333
PV1|1|I|CCU^W207^A^A||||12342^ALI^ROGERS^M^MD^MD|||SUR|||||||16532^ALI^ROGERS^M^MD^MD|INP||B|||||||||||||||||||A|||||20120531145230ORC|PA|11109489^PCM|11109489^PCM|94986|SC||1^Continuous^INDEF^20120613081900^1||20120613081958|RGYIDDER^YIDDER^ROBERT^GSYSTEM ADM^SA||16532^ALI^ROGERS^MMD^MD|CCU||20120613081958|||CCU|RGYIDDER^YIDDER^ROBERT^
G^SYSTEM ADM^SA
OBR|1|11109489^PCM|11109489^PCM|DNR ON^Hard of Hearing^NSG||20120613081122||||||||||16532^ALI^ROGERS^M^MD^MD|||||||||||1^Continuous^INDEF^20120613081900^1

And my attempt:

$/ = undef;         #tells perl to ignore newlines when reading input
$input = <STDIN>;   #read entire input into $input

$input =~ s/\R/ /g;  #remove all newlines from input. \R matches \r, \n, \r\n

@validSegHdrs = (   "ABS", "ACC", "ADD", "ADJ", "AFF", "AIG", "AIL", "AIP", "AIS", "AL1",
                    "APR", "ARQ", "ACC", "ADD", "ADJ", "AFF", "AIG", "AIL", "AIP", "AIS",
                    "AL1", "APR", "ARQ", "ARV", "AUT", "BHS", "BLC", "BLG", "BPO", "BPX",
                    "BTS", "BTX", "CDM", "CER", "CM0", "CM1", "CM2", "CNS", "CON", "CSP",
                    "CSR", "CSS", "CTD", "CTI", "DB1", "DG1", "DMI", "DRG", "DSC", "DSP",
                    "ECD", "ECR", "EDU", "EQP", "EQU", "ERR", "EVN", "FAC", "FHS", "FT1",
                    "FTS", "GOL", "GP1", "GP2", "GT1", "IAM", "IIM", "ILT", "IN1", "IN2",
                    "IN3", "INV", "IPC", "IPR", "ISD", "ITM", "IVC", "IVT", "LAN", "LCC",
                    "LCH", "LDP", "LOC", "LRL", "MFA", "MFE", "MFI", "MRG", "MSA", "MSH",
                    "NCK", "NDS", "NK1", "NPU", "NSC", "NST", "NTE", "OBR", "OBX", "ODS",
                    "ODT", "OM1", "OM2", "OM3", "OM4", "OM5", "OM6", "OM7", "ORC", "ORG",
                    "OVR", "PCE", "PCR", "PD1", "PDA", "PDC", "PEO", "PES", "PID", "PKG",
                    "PMT", "PR1", "PRA", "PRB", "PRC", "PRD", "PSG", "PSH", "PSL", "PSS",
                    "PTH", "PV1", "PV2", "PYE", "QAK", "QID", "QPD", "QRD", "QRF", "QRI",
                    "RCP", "RDF", "RDT", "REL", "RF1", "RFI", "RGS", "RMI", "ROL", "RQ1",
                    "RQD", "RXA", "RXC", "RXD", "RXE", "RXG", "RXO", "RXR", "SAC", "SCD",
                    "SCH", "SCP", "SDD", "SFT", "SID", "SLT", "SPM", "STF", "STZ", "TCC",
                    "TCD", "TQ1", "TQ2", "TXA", "UAC", "UB1", "UB2", "URD", "URS", "VAR",
                    "VND"
);

foreach (@validSegHdrs) {
    $input =~ s/$_/\r$_/g;
}

print $input; 

-

For what it's worth, I'm working with HL7. HL7 consists of "segments" each on its own line. The segment beginning with "MSH" is always first, and there must be a carriage return preceding each additional segment.

My input may have line breaks (or carriage returns) in the middle of a segment, which is not allowed. I also may encounter a new segment beginning on the same line as another one, which is also not allowed.

I intend to parse the input, first strip all line breaks, and find any matches of valid segment headers, and insert a carriage return before them. I have defined an array with all valid segment headers, and am attempting to use a foreach loop to do a simple search and replace to insert the \r before each match. I think it may be a good idea to match for each string plus '|', eg match on 'PV1|' to be more precise.

I'm not getting the expected output, so I humbly ask for some expertise. Thanks much!

Upvotes: 2

Views: 703

Answers (2)

perreal
perreal

Reputation: 98088

@validSegHdrs = (   "ABS", # .....
);

my $regex = join ("|", @validSegHdrs);
while (<>) {
  s/\R/ /g;
  s/($regex)/\r$1/g;
  print;
}

Upvotes: 1

amon
amon

Reputation: 57640

I used this script from the command line:

perl -e 'print "\n"; local $/; $in=<>; $in=~s/\R//g; my @blk = qw(ABS ACC ADD ADJ AFF AIG AIL AIP AIS AL1 APR ARQ ACC ADD ADJ AFF AIG AIL AIP AIS AL1 APR ARQ ARV AUT BHS BLC BLG BPO BPX BTS BTX CDM CER CM0 CM1 CM2 CNS CON CSP CSR CSS CTD CTI DB1 DG1 DMI DRG DSC DSP ECD ECR EDU EQP EQU ERR EVN FAC FHS FT1 FTS GOL GP1 GP2 GT1 IAM IIM ILT IN1 IN2 IN3 INV IPC IPR ISD ITM IVC IVT LAN LCC LCH LDP LOC LRL MFA MFE MFI MRG MSA MSH NCK NDS NK1 NPU NSC NST NTE OBR OBX ODS ODT OM1 OM2 OM3 OM4 OM5 OM6 OM7 ORC ORG OVR PCE PCR PD1 PDA PDC PEO PES PID PKG PMT PR1 PRA PRB PRC PRD PSG PSH PSL PSS PTH PV1 PV2 PYE QAK QID QPD QRD QRF QRI RCP RDF RDT REL RF1 RFI RGS RMI ROL RQ1 RQD RXA RXC RXD RXE RXG RXO RXR SAC SCD SCH SCP SDD SFT SID SLT SPM STF STZ TCC TCD TQ1 TQ2 TXA UAC UB1 UB2 URD URS VAR VND); $in=~s/$_/\n$_/ for @blk; print $in, "\n";'

And got this output:

MSH|^~\&|PCM|A|NSG|A|20120613081122|DoNotBundle|ORM^O01|1133316|P|2.2|||AL|NE
PID|1|1234567^PI^PE|345235^ST02A^MR^A~02340395^ST02^PI||HSM^AERHART||19510418000000|F||||||||||1215200001^A|111-22-3333
PV1|1|I|CCU^W207^A^A||||12342^ALI^ROGERS^M^MD^MD|||SUR|||||||16532^ALI^ROGERS^M^MD^MD|INP||B|||||||||||||||||||A|||||20120531145230
ORC|PA|11109489^PCM|11109489^PCM|94986|SC||1^Continuous^INDEF^20120613081900^1||20120613081958|RGYIDDER^YIDDER^ROBERT^GSYSTEM     ADM^SA||16532^ALI^ROGERS^MMD^MD|CCU||20120613081958|||CCU|RGYIDDER^YIDDER^ROBERT^G^SYSTEM     ADM^SA
OBR|1|11109489^PCM|11109489^PCM|DNR ON^Hard of Hearing^NSG||20120613081122||||||||||16532^ALI^ROGERS^M^MD^MD|||||||||||1^Continuous^INDEF^20120613081900^1

If the script were written indented, it would look like this:

local $/;
$in=<>;
$in=~s/\R//g;
my @blk = qw(
    ABS ACC ADD ADJ AFF AIG AIL AIP AIS AL1 APR ARQ ACC ADD ADJ AFF AIG AIL AIP
    AIS AL1 APR ARQ ARV AUT BHS BLC BLG BPO BPX BTS BTX CDM CER CM0 CM1 CM2 CNS
    CON CSP CSR CSS CTD CTI DB1 DG1 DMI DRG DSC DSP ECD ECR EDU EQP EQU ERR EVN
    FAC FHS FT1 FTS GOL GP1 GP2 GT1 IAM IIM ILT IN1 IN2 IN3 INV IPC IPR ISD ITM
    IVC IVT LAN LCC LCH LDP LOC LRL MFA MFE MFI MRG MSA MSH NCK NDS NK1 NPU NSC
    NST NTE OBR OBX ODS ODT OM1 OM2 OM3 OM4 OM5 OM6 OM7 ORC ORG OVR PCE PCR PD1
    PDA PDC PEO PES PID PKG PMT PR1 PRA PRB PRC PRD PSG PSH PSL PSS PTH PV1 PV2
    PYE QAK QID QPD QRD QRF QRI RCP RDF RDT REL RF1 RFI RGS RMI ROL RQ1 RQD RXA
    RXC RXD RXE RXG RXO RXR SAC SCD SCH SCP SDD SFT SID SLT SPM STF STZ TCC TCD
    TQ1 TQ2 TXA UAC UB1 UB2 URD URS VAR VND);
$in=~s/$_/\n$_/ for @blk;
print $in, "\n";

You would replace the \n with a \r I guess.

I don't know what the real difference between our scripts is, but it works for me??

Do note that using a hash could be more efficient (O(n)O(1) where n is the number of header sequences):

 my %hash = map {$_ => 1} @blk;
 # Test if $1 is a header sequence, if so, print newline
 $in =~ s/( [A-Z0-9]{3} )/ $hash{$1} ? "\n$1" : $1 /xeg;

Upvotes: 0

Related Questions