Reputation: 2725
How can I delete lines (rows) and columns in a text file that contain all the zeros. For example, I have a file:
1 0 1 0 1
0 0 0 0 0
1 1 1 0 1
0 1 1 0 1
1 1 0 0 0
0 0 0 0 0
0 0 1 0 1
I want to delete 2nd and 4th line and also the 2nd column. The output should look like:
1 0 1 1
1 1 1 1
0 1 1 1
1 1 0 0
0 0 1 1
I can do this using sed and egrep
sed '/0 0 0 0/d' or egrep -v '^(0 0 0 0 )$'
for lines with zeros but that would too inconvenient for files with thousands of columns. I have no idea how can I remove the column with all zeros, 2nd column here.
Upvotes: 6
Views: 4225
Reputation: 26591
Checking the rows can be done this way: awk '/[^0[:blank:]]/' file
It just states if a line contains any character that is different from 0
or a <blank> character, then print the line.
If you now want to check the columns, then I suggest an adaptation of Glenn Jackman's answer
awk '
NR==1 {for (i=1; i<=NF; i++) if ($i == 0) zerocol[i]=1; next}
NR==FNR {for (idx in zerocol) if ($idx) delete zerocol[idx]; next}
/[^0[:blank:]]/ {for (i=1; i<=NF; i++) if (i in zerocol) $i=""; print}
' file file
Upvotes: 0
Reputation: 11
My compact and large-file-compatible alternative using grep and cut. Only drawback : lengthy for large files because of the for loop.
# Remove constant lines using grep
$ grep -v "^[0 ]*$\|^[1 ]*$" $fIn > $fTmp
# Remove constant columns using cut and wc
$ nc=`cat $fTmp | head -1 | wc -w`
$ listcol=""
$ for (( i=1 ; i<=$nc ; i++ ))
$ do
$ nitem=`cut -d" " -f$i $fTmp | sort | uniq | wc -l`
$ if [ $nitem -gt 1 ]; then listcol=$listcol","$i ;fi
$ done
$ listcol2=`echo $listcol | sed 's/^,//g'`
$ cut -d" " -f$listcol2 $fTmp | sed 's/ //g' > $fOut
Upvotes: 0
Reputation: 514
This is a real tricky and challenging question.. so in order to solve we need to be tricky too :) in my version I depend on script learning, every time we read a new line we check for new field possibility to be omitted and if new change detected we start over.
The check and start over process should not be repeated so often as we will have few rounds until we get a constant number of fields to omit or zero, then we omit each row zero value at specific position.
#! /usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
open my $fh, '<', 'file.txt' or die $!;
##open temp file for output
open my $temp, '>', 'temp.txt' or die $!;
##how many field you have in you data
##you can increase this by one if you have more fields
my @fields_to_remove = (0,1,2,3,4);
my $change = $#fields_to_remove;
while (my $line = <$fh>){
if ($line =~ /1/){
my @new = split /\s+/, $line;
my $i = 0;
for (@new){
unless ($_ == 0){
@fields_to_remove = grep(!/$i/, @fields_to_remove);
}
$i++;
}
foreach my $field (@fields_to_remove){
$new[$field] = 'x';
}
my $new = join ' ', @new;
$new =~ s/(\s+)?x//g;
print $temp $new . "\n";
##if a new change detected start over
## this should repeat for limited time
## as the script keeps learning and eventually stop
if ($#fields_to_remove != $change){
$change = $#fields_to_remove;
seek $fh, 0, 0;
close $temp;
unlink 'temp.txt';
open $temp, '>', 'temp.txt';
}
} else {
##nothing -- removes 0 lines
}
}
### this is just for showing you which fields has been removed
print Dumper \@fields_to_remove;
I have tested with 9 fields 25mb data file and it worked perfectly it wasn't super fast but it didn't consume much memory as well.
Upvotes: 1
Reputation: 107090
Off the top of my head...
The problem is the columns. How do you know if a column is all zeros until you read in the entire file?
I'm thinking you need an array of the columns with each array being the column. You can push in the amounts. An array of arrays.
The trick is to skip the rows that contain all zeros as you read them in:
#! /usr/bin/env perl
#
use strict;
use warnings;
use autodie;
use feature qw(say);
use Data::Dumper;
my @array_of_columns;
for my $row ( <DATA> ) {
chomp $row;
next if $row =~ /^(0\s*)+$/; #Skip zero rows;
my @columns = split /\s+/, $row;
for my $index ( (0..$#columns) ) {
push @{ $array_of_columns[$index] }, $columns[$index];
}
}
# Remove the columns that contain nothing but zeros;
for my $column ( (0..$#array_of_columns) ) {
my $index = $#array_of_columns - $column;
my $values = join "", @{ $array_of_columns[$index] };
if ( $values =~ /^0+$/ ) {
splice ( @array_of_columns, $index, 1 );
}
}
say Dumper \@array_of_columns;
__DATA__
1 0 1 0 1
0 0 0 0 0
1 1 1 0 1
0 1 1 0 1
1 1 0 0 0
0 0 0 0 0
0 0 1 0 1
Of course, you could use Array::Transpose which will transpose your array which makes things much easier.
Upvotes: 1
Reputation: 75618
This is my awk solution. It would work with variable number of rows and columns.
#!/usr/bin/gawk -f
BEGIN {
FS = " "
}
{
for (c = 1; c <= NF; ++c) {
v = $c
map[c, NR] = v
ctotal[c] += v
rtotal[NR] += v
}
fields[NR] = NF
}
END {
for (r = 1; r <= NR; ++r) {
if (rtotal[r]) {
append = 0
f = fields[r]
for (c = 1; c <= f; ++c) {
if (ctotal[c]) {
if (append) {
printf " " map[c, r]
} else {
printf map[c, r]
append = 1
}
}
}
print ""
}
}
}
Upvotes: 1
Reputation: 26141
Little bit unorthodox solution but fast as hell and small memory consumption:
perl -nE's/\s+//g;$m|=$v=pack("b*",$_);push@v,$v if$v!~/\000/}{$m=unpack("b*",$m);@m=split//,$m;@m=grep{$m[$_]eq"1"}0..$#m;say"@{[(split//,unpack(q(b*),$_))[@m]]}"for@v'
Upvotes: 1
Reputation: 118166
The following script also makes two passes. During the first pass, it saves the line numbers of lines to be omitted from the output and the column indexes that should be included in the output. In the second pass, it outputs those lines and columns. I think this should provide close to the smallest possible memory footprint which might matter if you are dealing with large files.
#!/usr/bin/env perl
use strict;
use warnings;
filter_zeros(\*DATA);
sub filter_zeros {
my $fh = shift;
my $pos = tell $fh;
my %nonzero_cols;
my %zero_rows;
while (my $line = <$fh>) {
last unless $line =~ /\S/;
my @row = split ' ', $line;
my @nonzero_idx = grep $row[$_], 0 .. $#row;
unless (@nonzero_idx) {
$zero_rows{$.} = undef;
next;
}
$nonzero_cols{$_} = undef for @nonzero_idx;
}
my @matrix;
{
my @idx = sort {$a <=> $b } keys %nonzero_cols;
seek $fh, $pos, 0;
local $. = 0;
while (my $line = <$fh>) {
last unless $line =~ /\S/;
next if exists $zero_rows{$.};
print join(' ', (split ' ', $line)[@idx]), "\n";
}
}
}
__DATA__
1 0 1 0 1
0 0 0 0 0
1 1 1 0 1
0 1 1 0 1
1 1 0 0 0
0 0 0 0 0
0 0 1 0 1
Output:
1 0 1 1 1 1 1 1 0 1 1 1 1 1 0 0 0 0 1 1
Upvotes: 1
Reputation: 247240
Rather than storing lines in memory, this version scans the file twice: Once to find the "zero columns", and again to find the "zero rows" and perform the output:
awk '
NR==1 {for (i=1; i<=NF; i++) if ($i == 0) zerocol[i]=1; next}
NR==FNR {for (idx in zerocol) if ($idx) delete zerocol[idx]; next}
{p=0; for (i=1; i<=NF; i++) if ($i) {p++; break}}
p {for (i=1; i<=NF; i++) if (!(i in zerocol)) printf "%s%s", $i, OFS; print ""}
' file file
1 0 1 1
1 1 1 1
0 1 1 1
1 1 0 0
0 0 1 1
A ruby program: ruby has a nice array method transpose
#!/usr/bin/ruby
def remove_zeros(m)
m.select {|row| row.detect {|elem| elem != 0}}
end
matrix = File.readlines(ARGV[0]).map {|line| line.split.map {|elem| elem.to_i}}
# remove zero rows
matrix = remove_zeros(matrix)
# remove zero rows from the transposed matrix, then re-transpose the result
matrix = remove_zeros(matrix.transpose).transpose
matrix.each {|row| puts row.join(" ")}
Upvotes: 4
Reputation: 290525
All together:
$ awk '{for (i=1; i<=NF; i++) {if ($i) {print; next}}}' file | awk '{l=NR; c=NF; for (i=1; i<=c; i++) {a[l,i]=$i; if ($i) e[i]++}} END{for (i=1; i<=l; i++) {for (j=1; j<=c; j++) {if (e[j]) printf "%d ",a[i,j] } printf "\n"}}'
This makes the row checking:
$ awk '{for (i=1; i<=NF; i++) {if ($i) {print; next}}}' file
1 0 1 1
1 0 1 0
1 0 0 1
It loops through all the fields of the line. If any of them are "true" (meaning not 0), it prints the line (print
) and breaks to next line (next
).
This makes the column checking:
$ awk '{l=NR; c=NF;
for (i=1; i<=c; i++) {
a[l,i]=$i;
if ($i) e[i]++
}}
END{
for (i=1; i<=l; i++){
for (j=1; j<=c; j++)
{if (e[j]) printf "%d ",a[i,j] }
printf "\n"
}
}'
It basically saves all the data in the a
array, l
number of lines, c
number of columns. e
is an array saving if a column has any value different from 0 or not. Then it loops and prints all fields just when e
array index is set, meaning if that column has any non-zero value.
$ cat a
1 0 1 0 1
0 0 0 0 0
1 1 1 0 1
0 1 1 0 1
1 1 0 0 0
0 0 0 0 0
0 0 1 0 1
$ awk '{for (i=1; i<=NF; i++) {if ($i) {print; next}}}' a | awk '{l=NR; c=NF; for (i=1; i<=c; i++) {a[l,i]=$i; if ($i) e[i]++}} END{for (i=1; i<=l; i++) {for (j=1; j<=c; j++) {if (e[j]) printf "%d ",a[i,j] } printf "\n"}}'
1 0 1 1
1 1 1 1
0 1 1 1
1 1 0 0
0 0 1 1
previous input:
$ cat file
1 0 1 1
0 0 0 0
1 0 1 0
0 0 0 0
1 0 0 1
$ awk '{for (i=1; i<=NF; i++) {if ($i) {print; next}}}' file | awk '{l=NR; c=NF; for (i=1; i<=c; i++) {a[l,i]=$i; if ($i) e[i]++}} END{for (i=1; i<=l; i++) {for (j=1; j<=c; j++) {if (e[j]) printf "%d ",a[i,j] } printf "\n"}}'
1 1 1
1 1 0
1 0 1
Upvotes: 3
Reputation: 786359
Another awk variant:
awk '{show=0; for (i=1; i<=NF; i++) {if ($i!=0) show=1; col[i]+=$i;}} show==1{tr++; for (i=1; i<=NF; i++) vals[tr,i]=$i; tc=NF} END{for(i=1; i<=tr; i++) { for (j=1; j<=tc; j++) { if (col[j]>0) printf("%s%s", vals[i,j], OFS)} print ""; } }' file
Expanded Form:
awk '{
show=0;
for (i=1; i<=NF; i++) {
if ($i != 0)
show=1;
col[i]+=$i;
}
}
show==1 {
tr++;
for (i=1; i<=NF; i++)
vals[tr,i]=$i;
tc=NF
}
END {
for(i=1; i<=tr; i++) {
for (j=1; j<=tc; j++) {
if (col[j]>0)
printf("%s%s", vals[i,j], OFS)
}
print ""
}
}' file
Upvotes: 3
Reputation: 242443
Perl solution. It keeps all the non-zero lines in memory to be printed at the end, because it cannot tell what columns will be non-zero before it processes the whole file. If you get Out of memory
, you may only store the numbers of the lines you want to output, and process the file again while printing the lines.
#!/usr/bin/perl
use warnings;
use strict;
my @nonzero; # What columns where not zero.
my @output; # The whole table for output.
while (<>) {
next unless /1/;
my @col = split;
$col[$_] and $nonzero[$_] ||= 1 for 0 .. $#col;
push @output, \@col;
}
my @columns = grep $nonzero[$_], 0 .. $#nonzero; # What columns to output.
for my $line (@output) {
print "@{$line}[@columns]\n";
}
Upvotes: 4
Reputation: 13792
Try this:
perl -n -e '$_ !~ /0 0 0 0/ and print' data.txt
Or simply:
perl -n -e '/1/ and print' data.txt
Where data.txt
contains your data.
In Windows, use double quotes:
perl -n -e "/1/ and print" data.txt
Upvotes: 3