Omri
Omri

Reputation: 1656

Text files manipulating in Linux

I have a bunch of CSV files in a folder (OS: Ubuntu). All of them on the same structure. more than 2k columns (That's how of got it). The first column is ID.

I'm not able to work with SQL (nevermind why), so i guess i will need to work with bash command such as awk, cut, sed, etc which i have a basic knowledge on them.

I need to do the following: Run over the files (like the files are merged as one file): for each even column, check if it has one distinct value equal to 0 --> if so, delete the column and the next column. In addition, i need to print into a new file the index of the removed columns.

Example

file_1:
2231, 0, 5, 0, 9, 0, 9, 3, 3
1322, 0, 5, 0, 1, 0, 9, 2, 5
1233, 5, 5, 0, 3, 0, 9, 4, 6
1543, 2, 5, 0, 4, 0, 9, 6, 1
2341, 0, 5, 0, 7, 0, 9, 0, 2

files_2:
1322, 0, 5, 0, 3, 0, 9, 1, 2
1432, 0, 5, 0, 0, 0, 9, 3, 7
1434, 0, 5, 0, 8, 0, 9, 1, 4
1132, 0, 5, 0, 4, 0, 9, 3, 5
1434, 0, 5, 0, 7, 0, 9, 1, 0

Expected result:

Removed index columns file: 4, 5, 6, 7

    file_1 content:
    2231, 0, 5, 3, 3
    1322, 0, 5, 2, 5
    1233, 5, 5, 4, 6
    1543, 2, 5, 6, 1
    2341, 0, 5, 0, 2

    files_2 content:
    1322, 0, 5, 1, 2
    1432, 0, 5, 3, 7
    1434, 0, 5, 1, 4
    1132, 0, 5, 3, 5
    1434, 0, 5, 1, 0

Is it possible to do that with one these bash commands? If so, how? Any other solution will be good as well, but i prefer the bash commands.

Upvotes: 1

Views: 402

Answers (3)

anubhava
anubhava

Reputation: 785631

You can use awk to skip these columns with all zeroes:

awk 'BEGIN { FS=OFS=", " }
NR==1 {
   for (i=2; i<=NF; i+=2)
      a[i]
} FNR==NR {
   for (i=2; i<=NF; i+=2)
      if (i in a && $i>0)
         delete a[i];
   next
} {
   for (i=1; i<=NF; i++)
      if (!(i in a))
         printf "%s%s", $i, (i<NF)? OFS : RS
}' file1 file1

Output:

2231, 0, 5, 9, 9, 3, 3
1322, 0, 5, 1, 9, 2, 5
1233, 5, 5, 3, 9, 4, 6
1543, 2, 5, 4, 9, 6, 1
2341, 0, 5, 7, 9, 0, 2

It is using array a to keep even column numbers that should be skipped from output.

In the 1st pass when:

NR==1   # will run for first row to create an array a with even # of columns as index
FNR==NR # block will run for 1st pass of the file. It will delete entries from array a
        # if current value is greater than zero.
{...}   # in the 2nd pass we iterate each column and print if col is not in array a

UPDATE:

As per comments below

awk 'BEGIN{FS=OFS=","}
FNR==NR {
   for (i=1; i<=NF; i++)
      sums[i] += $i;
   ++r;
   next
} {
   for (i=1; i<=NF; i++)
      if (sums[i] > 0 && sums[i+1]>0 && sums[i] != 100*r)
         printf "%s%s", (i>1)?OFS:"", $i;
      print ""
}' file file

Upvotes: 3

NeronLeVelu
NeronLeVelu

Reputation: 10039

for File in ListOfFile
 do
    sed -i 's/\(\(\[^,]*,\)\{3\}\)\(\[^,]*,\)\{4\}\(.*\)/\1\4/' File
 done
  • provide the ListOfFile as you want (file list, find, ls, ..., wild card)
  • with GNU sed for the use of -i for inline editing (direct modification of file itself)

Upvotes: 1

Peter Cordes
Peter Cordes

Reputation: 365257

You can't tell which columns to remove without looking at every line of every file.

I picked perl for throwing around strings and lists. You could do this in awk, but awk only has associative arrays. I think it would be annoying to filter out kill candidates in awk.

If you already know Python, use it. This is mostly throwing around lists, not strings, and python is good at that.

A 2-pass algorithm could go like this:

# pseudo-perl, untested code
#!/usr/bin/perl -w
$line_scalar = <>;  # read the first line;
@line = split /\s*,\s*/, $list_scalar;
killcols= indices of columns that are '0' in @line;

while(<>) {
    @line = split /\s*,\s*/, $_;

    # filter out candidates that aren't zero in this line.  (string '0' is false in perl)
    @killcols = grep { ! $line[$_] } @killcols;
}

Since the number of columns to be removed is probably small, the list of candidates will quickly become short. Checking just the remaining candidates in each line should be significantly faster than looking over every field to find the ones that are zero. You still have to read and split each line, but this should be good. It's far better than reading the whole thing into memory and then looping over each line once for each column.

If your data set is too large to keep in memory, just re-read and re-split the input files once you know which columns to remove.

If your data set isn't huge, you can store the lines in memory as you read them, and avoid reading/splitting them again. You'd have to keep track of which line goes with which file, though. In perl, probably a list of lists of parsed lines for each file would be good. To keep track of which list goes with which file, keep the LoLs in a hash.

In perl, normal lists flatten out. To make a list of lists, you need to make a list of list-refs.

My perl is rusty, so I'm probably getting the reference syntax wrong.

# UNTESTED code, pretty sure there are minor bugs
# You'll prob. rewrite in python anyway

...  # get number of columns from the first line, outside the loop
while(<>) {
    @line = split /\s*,\s*/, $_;
    push @{$lines{$ARGV}}, [ @line ];  # append the parsed fields to the list of lines for the current file

    # filter out candidates that aren't zero in this line.  (string '0' is false in perl)
    @killcols = grep { ! $line[$_] } @killcols;
}

#my @keepcols;
#foreach my $col (1..$maxcol) {
#    push @keepcols, $col if ! grep { $col == $_ } @killcols;
#}
# IDK if this is the most efficient way to invert a set
# In any case, actually storing a list of all the column numbers *to* print
# is probably slower than looping over the ranges between @killcol.

foreach my $f (@ARGV) {
    open OUTFILE, ">", "$f.new" or die "error opening $f.new: $!";
    foreach my $lref (@$lines{$f}) {
        my $printcol = 0;
        foreach my $kcol (@killcols) {
            # FIXME: delimiter logic is probably not bulletproof e.g. when the last killed column = maxcol
            print(OUTFILE, join(', ', $lref->[$printcol .. $kcol-1]) );
            print OUTFILE, ', ';
            $printcol = $kcol+1;    # skip the killed column
        }
        print(OUTFILE, join(', ', $lref->[$printcol .. $maxcol] ), "\n");
        # no trailing ", " for the last one.
        # Could append $maxcol+1 to @killcols, since I think the delimiter logic needs fixing anyway.
    }
    close OUTFILE;
    # close "$f.new", and if there weren't any I/O errors, rename it over "$f"
}

Upvotes: 0

Related Questions