jennifer.cl
jennifer.cl

Reputation: 225

Perform sequence of edits on a large text file

I am hoping to perform a series of edits to a large text file composed almost entirely of single letters, seperated by spaces. The file is about 300 rows by about 400,000 columns, and about 250 MB.

My goal is to tranform this table using a series of steps, for eventual processing with another language (R, probably). I don't have much experience working with big data files, but PERL has been suggested to me as the best way to go about this. Please let me know if there is a better way :).

So, I am hoping to write a PERL script that does the following:

  1. Open file, edit or write to a new file the following:
  2. remove columns 2-6
  3. merge/concatenate pairs of columns, starting with column 2 (so, merge column 2-3,4-5, etc)
  4. replace each character pair according to sequential conditional algorithm running accross each row:

    [example PSEUDOCODE: if character 1 of cell = character 2 of cell=a,  cell=1
    else if character 1 of cell = character 2 of cell=b, cell=2
    etc.] such that except for the first column, the table is a numerical matrix
    
  5. remove every nth column, or keep every nth column and remove all others

I am just starting to learn PERL, so I was wondering if these operations were possible in PERL, whether PERL would be the best way to do them, and if there were any suggestions for syntax on these operations in the context of reading/writing to a file.

Upvotes: 2

Views: 172

Answers (2)

Len Jaffe
Len Jaffe

Reputation: 3484

I'll start:

use strict;
use warnings;
my @transformed;
while (<>) {
  chomp;
  my @cols = split(/\s/);  # split on whitespace
  splice(@cols, 1,6);      # remove columns
  push @transformed, $cols[0];
  for (my $i = 1; $i < @cols; $i += 2) {
    push @transformed, "$cols[$i]$cols[$i+1]";
  }

  # other transforms as required


  print join(' ', @transformed), "\n";
}

That should get you on your way.

Upvotes: 1

Ed Morton
Ed Morton

Reputation: 204477

You need to post some sample input and expected output or we're just guessing what you want but maybe this will be a start:

awk '{
   printf "%s ", $1
   for (i=7;i<=NF;i+=2) {
      printf "%s%s ", $i, $(i+1)
   }
   print ""
}' file

Upvotes: 0

Related Questions