Vikas
Vikas

Reputation: 327

Multiplying values from a column according to specific regular expression using Perl

I have a a big tab delimited file (10 gb) with 8 columns.

Col1        Col2    Col3 Col4     Col5        Col6       Col7    Col8

101_#2        1       2    F0       263        248        2       1.5

102_#1        1       6    F1       766        741        1       1.0

103_#1        2       15   V1       526        501        1       0.0

103_#1        2       9    V2       103        178        1       1.3

104_#1        1       12   V3       137        112        1       1.0

105_#1        1       17   F2       766        741        1       1.0

I want to multiply values in col8 with values in col1 present after "#" (in col1) so that output should be ->

Col1        Col2    Col3 Col4     Col5        Col6       Col7    Col8

101_#3        1       2    F0       263        248        2       1.5

102_#1        1       6    F1       766        741        1       1.0

103_#0        2       15   V1       526        501        1       0.0

103_#1.3      2       9    V2       103        178        1       1.3

104_#1        1       12   V3       137        112        1       1.0

105_#1        1       17   F2       766        741        1       1.0

The first row is header and I want that same in output(so no changes for 1st row).

Effort:

use strict;
use warnings;

@ARGV or die "No input file specified";

open my $fh, '<', $ARGV[0] or die "Unable to open input file: $!";
print scalar(<$fh>);

while (<$fh>) {
    chomp;
}

Upvotes: 0

Views: 633

Answers (4)

TLP
TLP

Reputation: 67900

If your data is proper csv data, I would suggest using a CSV module when parsing it. For example Text::CSV or Text::CSV_XS.

Replace the DATA and STDOUT file handles as required. The CSV options may need to be tweaked to fit your data, refer to the documentation. This is a basic usage of the module Text::CSV_XS:

#!/usr/bin/perl
use strict;
use warnings;

use Text::CSV_XS;

my $csv = Text::CSV_XS->new({
        sep_char => "\t",
        binary  => 1,
        eol     => $/,
    });

my $hrs = <DATA>;
print $hrs;

while (my $row = $csv->getline(*DATA)) {
    $row->[0] =~ s/#\K(\d+)$/ $row->[7] * $1 /e;
    $csv->print(*STDOUT, $row );
}

__DATA__
Col1    Col2    Col3    Col4    Col5    Col6    Col7    Col8
101_#2  1   2   F0  263 248 2   1.5
102_#1  1   6   F1  766 741 1   1.0
103_#1  2   15  V1  526 501 1   0.0
103_#1  2   9   V2  103 178 1   1.3
104_#1  1   12  V3  137 112 1   3.0
105_#1  1   17  F2  766 741 1   23.0

Note that data above may not contain proper tabs due to StackOverflow conversion.

Upvotes: 0

Lumi
Lumi

Reputation: 15264

Use unpack:

use strict;
use warnings;
no warnings 'uninitialized';

# fixed-width file, so use unpack
# offsets: 20 28 33 42 58 74 82

my $header = <>; # ignore

while ( <> ) {
#   print;
    my @cols = unpack 'a19 a8 a5 a9 a16 a16 a8 a*';
#   print "$_\n" for @cols; exit;
    s/\s+$// for @cols; # trim
#   print join(', ', @cols), "\n";
    my $num;
    if ( 0 <= (my $idx = rindex $cols[0], '#') ) {
        $num = substr $cols[0], $idx + 1;
    }
    else {
        warn "no number after # in col1\n";
    }
    printf "%f * %f = %f\n", $num, $cols[7], $num * $cols[7];
}

Upvotes: 0

Zaid
Zaid

Reputation: 37136

In the absence of a concerted effort on the OP's part, an explanation should suffice:

  • Use a Perl one-liner to process this file line-by-line
  • The -i flag will enable in-place editing of the file. -i.bak creates a backup
  • Use $. in a conditional to skip the header line
  • Columns 1 and 8 can be accessed through the -a flag, which will autosplit the line on whitespaces to generate the @F array. The -F flag could be used to specify the split delimiter. Testing for @F emptiness can also be employed to skip empty lines
  • The s///e construct will be useful for updating the value to what you desire
  • autochomping with the -l flag is highly recommended

See perldoc perlrun, perldoc perlretut and perldoc perlop for more information

Upvotes: 2

Mat
Mat

Reputation: 206669

Here's one way you could do it. The idea is to skip the headers, then simply split the lines into columns and extracting the information you want.

use strict;
use warnings;

# Skip header rows
print scalar(<>);
print scalar(<>);

# Process each other line
while (<>) {
    # Skip empty lines
    print and next if /^\s*$/;
    # Split on whitespace
    my @cols = split(/\s+/);
    # Split the first column on '#', removing it from the column list
    my ($p1, $p2) = split(/#/, shift @cols);
    # Multiply and print (original whitespace replaces with tabs
    print $p1, "#", $cols[6]*$p2, "\t", join("\t", @cols), "\n";
}

Upvotes: 1

Related Questions