zara
zara

Reputation: 1108

How to delete the first subset of each set of column in a data file?

I have a data file with more than 40000 column. In header each column's name begins with C1 , c2, ..., cn and each set of c has one or several subset for example c1. has 2 subsets. I need to delete first column(subset) of each set of c. for example if input looks like :

input:

    c1.20022  c1.31012  c2.44444  c2.87634  c2.22233 c3.00444  c3.44444 
     1    1         0         1         0         0         0         1     
     2    0         1         0         0         1         0         1     
     3    0         1         0         0         1         1         0     
     4    1         0         1         0         0         1         0     
     5    1         0         1         0         0         1         0     
     6    1         0         1         0         0         1         0     

I need the output be like:

    c1.31012  c2.87634  c2.22233  c3.44444 
     1    0         0         0         1     
     2    1         0         1         1     
     3    1         0         1         0     
     4    0         0         0         0     
     5    0         0         0         0     
     6    0         0         0         0     
     7    1         0         0         0     

Any suggestion please?

update: If there be no space between digits in row (which is th real situation of my data set) then what should I do? my mean is that my real data looks like this: input:

c1.20022  c1.31012  c2.44444  c2.87634  c2.22233 c3.00444  c3.44444 
         1    1010001     
         2    0100101     
         3    0100110     
         4    1010010     
         5    1010010     
         6    1010010     

and output:

c1.31012  c2.87634  c2.22233  c3.44444 
         1    0001     
         2    1011     
         3    1010     
         4    0000     
         5    0000     
         6    0000     
         7    1000     

Upvotes: 0

Views: 53

Answers (1)

choroba
choroba

Reputation: 241968

Perl solution: It first reads the header line, uses a regex to extract the column name before a dot, and keeps a list of column numbers to keep. It then uses the indices to print only the wanted columns from the header and remaining lines.

#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };

my @header = split ' ', <>;
my $last = q();
my @keep;
for my $i (0 .. $#header) {
    my ($prefix) = $header[$i] =~ /(.*)\./;
    if ($prefix eq $last) {
        push @keep, $i + 1;
    }
    $last = $prefix;
}
unshift @header, q();
say join "\t", @header[@keep];

while (<>) {
    my @columns = split;
    say join "\t", @columns[@keep];
}

Update:

#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };

my @header = split ' ', <>;
my $last = q();
my @keep;
for my $i (0 .. $#header) {
    my ($prefix) = $header[$i] =~ /(.*)\./;
    if ($prefix eq $last) {
        push @keep, $i;
    }
    $last = $prefix;
}
say join "\t", @header[@keep];

while (<>) {
    my ($line_number, $all_digits) = split;
    my @digits = split //, $all_digits;
    say join "\t", $line_number, join q(), @digits[@keep];
}

Upvotes: 2

Related Questions