Reputation: 31
I have a file that contains several subgroups of codes and description. I need parse each section that starts with "VALUE" until I see a semi-colon ";".
The file is .sas. The VALUE statement tells me the type of reference data and all following lines are instances until a semi-colon denotes the end of that group. I have written something that works but as an old java developer it is very procedural and ugly. I am sure there is a much more efficient perl approach to this problem. By efficient, I mean that I am brute forcing the validation with many IF/ELSE statements.
Here is a snippet of a .sas file I am parsing (NOTE: this is an incomplete file but has the use case):
*********************************************************************
MARCH 20, 2018 2:05 PM
This is an example of a SAS program that creates a SAS
file from the 2017 NHIS Public Use HOUSEHLD.DAT ASCII file
This is stored in HOUSEHLD.SAS
*********************************************************************;
* USER NOTE: PLACE NEXT STATEMENT IN SUBSEQUENT PROGRAMS;
LIBNAME NHIS "C:\NHIS2017";
* USER NOTE: PLACE NEXT STATEMENT IN SUBSEQUENT PROGRAMS
IF YOU ALLOW PROGRAM TO PERMANENTLY STORE FORMATS;
LIBNAME LIBRARY "C:\NHIS2017";
FILENAME ASCIIDAT 'C:\NHIS2017\HOUSEHLD.dat';
* DEFINE VARIABLE VALUES FOR REPORTS;
* USE THE STATEMENT "PROC FORMAT LIBRARY=LIBRARY"
TO PERMANENTLY STORE THE FORMAT DEFINITIONS;
* USE THE STATEMENT "PROC FORMAT" IF YOU DO NOT WISH
TO PERMANENTLY STORE THE FORMATS;
PROC FORMAT LIBRARY=LIBRARY;
*PROC FORMAT;
VALUE $GROUPC
' '< - HIGH = "Range of Values"
;
VALUE GROUPN
LOW - HIGH = "Range of Values"
;
VALUE HHP001X
10 = "10 Household"
20 = "20 Person"
25 = "25 Income Imputation"
30 = "30 Sample Adult"
38 = "38 Functioning and Disability"
40 = "40 Sample Child"
60 = "60 Family"
63 = "63 Family Disability Questions"
65 = "65 Paradata"
70 = "70 Injury/Poisoning Episode"
75 = "75 Injury/Poisoning Verbatim"
;
VALUE HHP008X
01 = "01 House, apartment, flat, condo"
02 = "02 HU in nontransient hotel, motel"
03 = "03 HU-permanent in transient hotel, motel"
04 = "04 HU in rooming house"
05 = "05 Mobile home/trailer w/no permanent rooms added"
06 = "06 Mobile home/trailer w/1+ permanent rooms added"
07 = "07 HU not specified above"
08 = "08 Quarters not HU in room or board house"
09 = "09 Unit not permanent-transient hotel, motel"
10 = "10 Unoccupied site for mobile home/trailer/tent"
11 = "11 Student quarters in college dormitory"
12 = "12 Group quarter unit not specified above"
98 = "98 Not ascertained"
;
VALUE HHP009X
1 = "1 Refused"
2 = "2 No one home - repeated calls"
3 = "3 Temporarily absent"
4 = "4 Language problem"
5 = "5 Other"
;
VALUE HHP015X
1 = "1 Northeast"
2 = "2 Midwest"
3 = "3 South"
4 = "4 West"
;
DATA NHIS.HOUSEHLD;
* CREATE A SAS DATA SET;
INFILE ASCIIDAT PAD LRECL=47;
Here is my script
#!/usr/bin/perl
# This script looks through a file for the word "VALUE"
# If it finds the word, it will identify the value type and
# then process code/description rows until it finds a semi-colon.
# A semi-colon resets a new search for a value type to begin
use strict;
use warnings;
use diagnostics;
my $file = 'HOUSEHLD.sas';
my $cnt = 0;
my $i = 0;
my $size = 0;
my $valgrp = "";
open my $fh, '<', $file || die "Could not open $file: $!";
while (my $line = <$fh>) {
chomp $line;
$cnt = ($line =~ s/(VALUE )/$1/g);
$line =~ s/^\s+|\s+$//g; #strip leading and trailing spaces
#does the array contain only one instance of 'VALUE'
#check if we are in a refernce value group
if ($valgrp eq "t") {
my @refval = split("=", $line);
if ($line ne ";" ){
print "code: $refval[0]";
print " description: $refval[1]\n";
}
# when you see a semi-colon you are at the end of referecnce block
elsif ($refval[0] eq ";") {
$valgrp ="f";
}
}
if ($cnt == 1) {
my @row = split(" ", $line);
if ( $row[0] eq "VALUE" && scalar(@row) == 2 ) {
print "code type: $row[1]\n";
$valgrp = "t";
}
}
}
close ($fh);
This is expected (but not final output). I will be creating a .csv file or placing directly into MySQL Table created by the VALUE Type. The first two VALUE types are not valid but they are here while I am working through the file. Not sure if $GROUPC and GROUPN are always the first two and whether I code some type of ignore.
code type: $GROUPC
code: ' '< - HIGH description: "Range of Values"
code type: GROUPN
code: LOW - HIGH description: "Range of Values"
code type: HHP001X
code: 10 description: "10 Household"
code: 20 description: "20 Person"
code: 25 description: "25 Income Imputation"
code: 30 description: "30 Sample Adult"
code: 38 description: "38 Functioning and Disability"
code: 40 description: "40 Sample Child"
code: 60 description: "60 Family"
code: 63 description: "63 Family Disability Questions"
code: 65 description: "65 Paradata"
code: 70 description: "70 Injury/Poisoning Episode"
code: 75 description: "75 Injury/Poisoning Verbatim"
code type: HHP002X
code: . description: '.'
code: OTHER description: "Survey Year"
Upvotes: 1
Views: 55
Reputation: 66924
Here is an approach similar to yours, simplified and cleaned up a bit. It's quite fine for this job.
use warnings;
use strict;
use feature 'say';
use Data::Dump qw(dd);
my $file = shift || die "Usage: $0 file\n";
open my $fh, '<', $file or die "Can't open $file: $!";
my (%data, $group_val, $in_group);
while (<$fh>)
{
if (/^\s*VALUE\s*(.*)/) {
$group_val = $1;
$in_group = 1;
next;
}
elsif (/^\s*;\s*$/) {
$in_group = 0;
}
next if not $in_group;
push @{$data{$group_val}}, [ map { s/^\s+|\s+$//gr } split /\s*=\s*/ ];
#say "$group_val: @refval";
}
dd \%data;
The [ ]
in the last line creates an anonymous array and this reference is added to the arrayref for $group_val
key of the hash with all results, %data
. Inside [ ]
a pair of values is obtained from $_
(input line) split on =
(with possible spaces around), which are cleaned up of spaces in map
.
I print data using Data::Dump, for convenient viewing. The output is as expected: for key HHP001X
the value is an arrayref, with arrayref elements ([10, '10 Household']
,...), etc. (I don't see what OTHER
in expected output is about, and I see no Survey...
in the sample).
I store data in the hash so that one can work with VALUE
s as keys, but if their order from the file need be kept then we'd need to either also record the order (in an array) so that the hash can be sorted by it, or to use an array (of arrayrefs) instead of a hash to store data.
Upvotes: 3
Reputation: 52549
The range operator (..
) is useful here.
This example just prints out the lines in blocks starting with VALUE and ending with a semicolon to give you a starting point:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say/;
while (<>) {
chomp;
my $match = /^\s*VALUE (\w+)/ .. /^\s*;$/;
if ($match ne "" && $match == 1) {
say "Code type: $1";
} elsif ($match !~ /^$|E0/) {
say $_; # to-do: print out in your 'code: XX description: YY' format
}
}
It takes advantage of the scalar range operator's return value to determine if the current line is a VALUE
one, an ending semicolon, or not:
The value returned is either the empty string for false, or a sequence number (beginning with 1) for true. The sequence number is reset for each range encountered. The final sequence number in a range has the string "E0" appended to it, which doesn't affect its numeric value, but gives you something to search for if you want to exclude the endpoint.
Upvotes: 2