user99187
user99187

Reputation: 13

Trying to modify awk code

awk  'BEGIN{OFS=","} FNR == 1
            {if (NR > 1) {print fn,fnr,nl}
                        fn=FILENAME; fnr = 1; nl = 0}
                        {fnr = FNR}
                        /ERROR/ && FILENAME ~ /\.gz$/ {nl++}
                        {
                            cmd="gunzip -cd " FILENAME
                            cmd; close(cmd)
                         }
            END                    {print fn,fnr,nl}
        ' /tmp/appscraps/* > /tmp/test.txt

the above scans all files in a given directory. prints the file name, number of lines in each file and number of lines found containing 'ERROR'.

im now trying to make it so that the script executes a command if any of the file it reads in isn't a regular file. i.e., if the file is a gzip file, then run a particular command.

above is my attempt to include the gunzip command in there and to do it on my own. unfortunately, it isn't working. also, i cannot "gunzip" all the files in the directory beforehand. this is because not all files in the directory will be "gzip" type. some will be regular files.

so i need the script to treat any .gz file it finds a different way so it can read it, count and print the number of lines that's in it, and the number of lines it found matching the pattern supplied (just as it would if the file had been a regular file).

any help?

Upvotes: 1

Views: 221

Answers (3)

Ed Morton
Ed Morton

Reputation: 203324

This part of your script makes no sense:

        {if (NR > 1) {print fn,fnr,nl}
                    fn=FILENAME; fnr = 1; nl = 0}
                    {fnr = FNR}
                    /ERROR/ && FILENAME ~ /\.gz$/ {nl++}

Let me restructure it a bit and comment it so it's clearer what it does:

{ # for every line of every input file, do the following:

    # If this is the 2nd or subsequent line, print the values of these variables:
    if (NR > 1) {
         print fn,fnr,nl
    } 

    fn = FILENAME    # set fn to FILENAME. Since this will occur for the first line of
                     # every file, this is that value fn will have when printed above,
                     # so why not just get rid of fn and print FILENAME?

    fnr = 1          # set fnr to 1. This is immediately over-written below by
                     # setting it to FNR so this is pointless.

    nl = 0

}
{ # for every line of every input file, also do the following
  # (note the unnecessary "}" then "{" above):

    fnr = FNR        # set fnr to FNR. Since this will occur for the first line of
                     # every file, this is that value fnr will have when printed above,
                     # so why not just get rid of fnr and print FNR-1?
} 

/ERROR/ && FILENAME ~ /\.gz$/ {

    nl++             # increment the value of nl. Since nl is always set to zero above,
                     # this will only ever set it to 1, so why not just set it to 1?
                     # I suspect the real intent is to NOT set it to zero above.

}

You also have the code above testing for a file name that ends in ".gz" but then you're running gunzip on every file in the very next block.

Beyond that, just call gunzip from shell as everyone else also suggested. awk is a tool for parsing text, it's not an environment from which to call other tools - that's what a shell is for.

For example, assuming your comment (prints the file name, number of lines in each file and number of lines found containing 'ERROR) accurately describes what you want your awk script to do and assuming it makes sense to test for the word "ERROR" directly in a ".gz" file using awk:

for file in /tmp/appscraps/*.gz
do
    awk -v OFS=',' '/ERROR/{nl++} END{print FILENAME, NR+0, nl+0}' "$file"
    gunzip -cd "$file"
done > /tmp/test.txt

Much clearer and simpler, isn't it?

If it doesn't make sense to test for the word ERROR directly in a ".gz" file, then you can do this instead:

for file in /tmp/appscraps/*.gz
do
    zcat "$file" | awk -v file="$file" -v OFS=',' '/ERROR/{nl++} END{print file, NR+0, nl+0}'
    gunzip -cd "$file"
done > /tmp/test.txt

To handle gz and non-gz files as you've now described in your comment below:

for file in /tmp/appscraps/*
do
    case $file in
        *.gz ) cmd="zcat" ;;
        * )    cmd="cat"  ;;
    esac

    "$cmd" "$file" |
        awk -v file="$file" -v OFS=',' '/ERROR/{nl++} END{print file, NR+0, nl+0}'

done > /tmp/test.txt

I left out the gunzip since you don't need it as far as I can tell from your stated requirements. If I'm wrong, explain what you need it for.

Upvotes: 1

Elisiano Petrini
Elisiano Petrini

Reputation: 602

I think it could be simpler than that.

With shell expansion you already have the file name (hence you can print it). So you can do a loop over all the files, and for each do the following:

  • print the file name
  • zgrep -c ERROR $file (this outputs the number of lines containing 'ERROR')
  • zcat $file|wc -l (this will output the line numbers)

zgrep and zcat work on both plain text files and gzipped ones.

Assuming you don't have any spaces in the paths/filenames:

for f in /tmp/appscraps/* 
do
   n_lines=$(zcat "$f"|wc -l)
   n_errors=$(zgrep -c ERROR "$f")
   echo "$f $n_lines $n_errors"
done

This is untested but it should work.

Upvotes: 1

blackSmith
blackSmith

Reputation: 3154

You can use execute the following command for each file :

gunzip -t FILENAME; echo $?

It will pass print the exit code 0(for gzip files) or 1(corrupt/other file). Now you can compare the output using IF to execute the required processing.

Upvotes: 0

Related Questions