Bill Wimsatt
Bill Wimsatt

Reputation: 31

Parse subsections of a file to independent groups

I have a file that contains several subgroups of codes and description. I need parse each section that starts with "VALUE" until I see a semi-colon ";".

The file is .sas. The VALUE statement tells me the type of reference data and all following lines are instances until a semi-colon denotes the end of that group. I have written something that works but as an old java developer it is very procedural and ugly. I am sure there is a much more efficient perl approach to this problem. By efficient, I mean that I am brute forcing the validation with many IF/ELSE statements.

Here is a snippet of a .sas file I am parsing (NOTE: this is an incomplete file but has the use case):

*********************************************************************
 MARCH 20, 2018  2:05 PM

 This is an example of a SAS program that creates a SAS
 file from the 2017 NHIS Public Use HOUSEHLD.DAT ASCII file

 This is stored in HOUSEHLD.SAS
*********************************************************************;

* USER NOTE: PLACE NEXT STATEMENT IN SUBSEQUENT PROGRAMS;
LIBNAME  NHIS     "C:\NHIS2017";

* USER NOTE: PLACE NEXT STATEMENT IN SUBSEQUENT PROGRAMS
             IF YOU ALLOW PROGRAM TO PERMANENTLY STORE FORMATS;
LIBNAME  LIBRARY  "C:\NHIS2017";

FILENAME ASCIIDAT 'C:\NHIS2017\HOUSEHLD.dat';

* DEFINE VARIABLE VALUES FOR REPORTS;

*  USE THE STATEMENT "PROC FORMAT LIBRARY=LIBRARY"
     TO PERMANENTLY STORE THE FORMAT DEFINITIONS;

*  USE THE STATEMENT "PROC FORMAT" IF YOU DO NOT WISH
      TO PERMANENTLY STORE THE FORMATS;

PROC FORMAT LIBRARY=LIBRARY;
*PROC FORMAT;

   VALUE $GROUPC
      ' '< - HIGH   = "Range of Values"
   ;
   VALUE GROUPN
      LOW - HIGH   = "Range of Values"
   ;
   VALUE HHP001X
      10                 = "10 Household"
      20                 = "20 Person"
      25                 = "25 Income Imputation"
      30                 = "30 Sample Adult"
      38                 = "38 Functioning and Disability"
      40                 = "40 Sample Child"
      60                 = "60 Family"
      63                 = "63 Family Disability Questions"
      65                 = "65 Paradata"
      70                 = "70 Injury/Poisoning Episode"
      75                 = "75 Injury/Poisoning Verbatim"
   ;

   VALUE HHP008X
      01                 = "01 House, apartment, flat, condo"
      02                 = "02 HU in nontransient hotel, motel"
      03                 = "03 HU-permanent in transient hotel, motel"
      04                 = "04 HU in rooming house"
      05                 = "05 Mobile home/trailer w/no permanent rooms added"
      06                 = "06 Mobile home/trailer w/1+ permanent rooms added"
      07                 = "07 HU not specified above"
      08                 = "08 Quarters not HU in room or board house"
      09                 = "09 Unit not permanent-transient hotel, motel"
      10                 = "10 Unoccupied site for mobile home/trailer/tent"
      11                 = "11 Student quarters in college dormitory"
      12                 = "12 Group quarter unit not specified above"
      98                 = "98 Not ascertained"
   ;
   VALUE HHP009X
      1                  = "1 Refused"
      2                  = "2 No one home - repeated calls"
      3                  = "3 Temporarily absent"
      4                  = "4 Language problem"
      5                  = "5 Other"
   ;
   VALUE HHP015X
      1                  = "1 Northeast"
      2                  = "2 Midwest"
      3                  = "3 South"
      4                  = "4 West"
   ;

DATA NHIS.HOUSEHLD;
   * CREATE A SAS DATA SET;
   INFILE ASCIIDAT PAD LRECL=47;

Here is my script

#!/usr/bin/perl

# This script looks through a file for the word "VALUE"
# If it finds the word, it will identify the value type and
# then process code/description rows until it finds a semi-colon. 
# A semi-colon resets a new search for a value type to begin

use strict;
use warnings;
use diagnostics;

my $file = 'HOUSEHLD.sas';
my $cnt = 0; 
my $i = 0;
my $size = 0;
my $valgrp = "";

open my $fh, '<', $file || die "Could not open $file: $!";

while (my $line = <$fh>) { 
    chomp $line; 

    $cnt = ($line =~ s/(VALUE )/$1/g);

    $line =~ s/^\s+|\s+$//g; #strip leading and trailing spaces


    #does the array contain only one instance of 'VALUE'
    #check if we are in a refernce value group
    if ($valgrp eq "t") {
        my @refval = split("=", $line); 
        if ($line ne ";" ){
            print "code: $refval[0]";
            print " description: $refval[1]\n";
        }
        # when you see a semi-colon you are at the end of referecnce block
        elsif ($refval[0] eq ";") { 
            $valgrp ="f";
        }
    }

    if ($cnt == 1) {

        my @row = split(" ", $line);    

        if ( $row[0] eq "VALUE" && scalar(@row) == 2 ) {
            print "code type: $row[1]\n";
            $valgrp = "t";
        }

    }

}

close ($fh);

This is expected (but not final output). I will be creating a .csv file or placing directly into MySQL Table created by the VALUE Type. The first two VALUE types are not valid but they are here while I am working through the file. Not sure if $GROUPC and GROUPN are always the first two and whether I code some type of ignore.

code type: $GROUPC
code: ' '< - HIGH    description:  "Range of Values"
code type: GROUPN
code: LOW - HIGH    description:  "Range of Values"
code type: HHP001X
code: 10                  description:  "10 Household"
code: 20                  description:  "20 Person"
code: 25                  description:  "25 Income Imputation"
code: 30                  description:  "30 Sample Adult"
code: 38                  description:  "38 Functioning and Disability"
code: 40                  description:  "40 Sample Child"
code: 60                  description:  "60 Family"
code: 63                  description:  "63 Family Disability Questions"
code: 65                  description:  "65 Paradata"
code: 70                  description:  "70 Injury/Poisoning Episode"
code: 75                  description:  "75 Injury/Poisoning Verbatim"
code type: HHP002X
code: .                    description:  '.'
code: OTHER               description:  "Survey Year"

Upvotes: 1

Views: 55

Answers (2)

zdim
zdim

Reputation: 66924

Here is an approach similar to yours, simplified and cleaned up a bit. It's quite fine for this job.

use warnings;
use strict;
use feature 'say';

use Data::Dump qw(dd);

my $file = shift || die "Usage: $0 file\n";

open my $fh, '<', $file or die "Can't open $file: $!";

my (%data, $group_val, $in_group);

while (<$fh>) 
{
    if (/^\s*VALUE\s*(.*)/) {
        $group_val = $1;
        $in_group = 1;
        next;
    }
    elsif (/^\s*;\s*$/) {
        $in_group = 0;
    }    
    next if not $in_group;

    push @{$data{$group_val}}, [ map { s/^\s+|\s+$//gr } split /\s*=\s*/ ];

    #say "$group_val: @refval";
}

dd \%data;

The [ ] in the last line creates an anonymous array and this reference is added to the arrayref for $group_val key of the hash with all results, %data. Inside [ ] a pair of values is obtained from $_ (input line) split on = (with possible spaces around), which are cleaned up of spaces in map.

I print data using Data::Dump, for convenient viewing. The output is as expected: for key HHP001X the value is an arrayref, with arrayref elements ([10, '10 Household'],...), etc. (I don't see what OTHER in expected output is about, and I see no Survey... in the sample).

I store data in the hash so that one can work with VALUEs as keys, but if their order from the file need be kept then we'd need to either also record the order (in an array) so that the hash can be sorted by it, or to use an array (of arrayrefs) instead of a hash to store data.

Upvotes: 3

Shawn
Shawn

Reputation: 52549

The range operator (..) is useful here.

This example just prints out the lines in blocks starting with VALUE and ending with a semicolon to give you a starting point:

#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say/;

while (<>) {
  chomp;
  my $match = /^\s*VALUE (\w+)/ .. /^\s*;$/;
  if ($match ne "" && $match == 1) {
    say "Code type: $1";
  } elsif ($match !~ /^$|E0/) {
    say $_; # to-do: print out in your 'code: XX description: YY' format
  }
}

It takes advantage of the scalar range operator's return value to determine if the current line is a VALUE one, an ending semicolon, or not:

The value returned is either the empty string for false, or a sequence number (beginning with 1) for true. The sequence number is reset for each range encountered. The final sequence number in a range has the string "E0" appended to it, which doesn't affect its numeric value, but gives you something to search for if you want to exclude the endpoint.

Upvotes: 2

Related Questions