gaussblurinc
gaussblurinc

Reputation: 3692

Perl How to find positions of captures

I have a space-seperated file like this:

 First        Second        Third       Forth
 It               is        possible    to   
 do             this                    task
 with          regex        but         i
 don't          know        how         to 

My task is to capture all the words of each line and construct a hash from them.

But here is my problem: Fields may be empty in any column (e.g. 3rd line, 3rd field).

Words in each line are aligned by the column's name at their beginning or end. (column's names are the words in the first line , e.g. First Second Third Forth)

In my example, words are aligned to left (or to beginning of column name) in First Third Forth columns and are aligned to right (or to end of column name) in Second

Using the hash from each line I have to create output formatted like this:

$hash{First} has Second-property $hash{Second}. It also has $hash{Third} and $hash{Forth}.

use File::Basename;
use locale;
open my $file, "<", $ARGV[0];
open my $file2,">>",fileparse($ARGV[0])."2.txt";
my @alls = <$file>;

sub Main{
my $first = shift @alls;
my $poses = First_And_Last($first);
my $curr_poses;
my $curr_hash;
#do{OutputLine($_->[0],$_->[1],$first)}for (@$poses);
my $result_array=[];
my @keys = qw(# Variable Type Len Format Informat Label);
for $word(@alls){
    $curr_poses=First_And_Last($word);
    undef ($curr_hash);
    $curr_hash = Take_Words($poses, $word, $curr_poses);
    push @{$result_array},$curr_hash; #AoH  
    }

#end of main
}

sub First_And_Last{
    #First_And_Last($str)
    my $str = shift;    
    my $begin;
    my $end;
    my $ref=[];
    while ($str=~m/(([\S\.]\s?)+\b|#)/g){       
        $begin = pos($str) - length($1);
        $end = pos($str);       
        push @{$ref},[$begin,$end];
        }               
    return $ref;
    }

sub Take_Words{
    #Take_Words($poses, $line,$current) 
    my $outref = {};
    my $ref = shift; #take the ref of offsets of words
    my $line = shift;# and the next line in file
    my $current = shift; # and this is the poses of current line
    my @keys = qw(# Variable Type Len Format Informat Label);
    do{$outref->{$_}=undef;}for(@keys);
    my $ethalon; #for $ref
    my $relativity; #for $current
    my $key; #for key in $outref
    my @ethalon = @{$ref};

    $ethalon = shift @ethalon;
    $relativity = shift @{$current};
    $key = shift @keys;

    while (defined($key) && defined($relativity)){
        if ($ethalon->[0] == $relativity->[0] || $ethalon->[1] == $relativity->[1]){    
                $outref->{$key} = substr($line, $relativity->[0],$relativity->[1] - $relativity->[0]);          

                $relativity = shift @{$current};
            }
            $ethalon = shift @ethalon;
            $key = shift @keys;         
        }


    return $outref;
    }

Upvotes: 1

Views: 171

Answers (1)

amon
amon

Reputation: 57646

Here is my algorithm, but it is somewhat C-ish:

  1. Determine the starting position of each column heading and store it.

  2. For each column: Go to the headings starting position.

  3. step left until you have passed two consecutive spaces.

  4. go right two characters, then remember the position.

  5. go right until you have passed two consecutive spaces.

  6. go left two characters, then remeber the position.

  7. Extract everything between the found boundaries.

  8. remove starting and trailing white spaces.

  9. Store in your hash

  10. repeat from step 2

Now we'll have to see about that implementation:

Step 1:

my @starting;
{
  my @char = split m{}, <$file>; # split the first line into char array
  my $spacecount = 0;
  my $state = 1; # 1 : find start -- 0 : find end
  for (my $i = 0; $i < @char; $i++) {
    if ($state) { # find next non-space
      if ($char[$i] =~ /\s/) {
        next;
      } else {
        $state = not $state; # flip
        $spacecount = 0;
        push @starting, $i;
        next;
      }
    } else {
      if ($char[$i] =~ /\s/) {
        $spacecount++;
        if ($spacecount >= 2) {
          $state = not $state; # flip
          next;
        }
      } else {
        $spacecount = 0; # reset consecutive space counter
        next;
      }
    }
  }
}

Upvotes: 2

Related Questions