justaguy
justaguy

Reputation: 3022

bash to identify and verify file headers

Using the tab-delimited file below I am trying to validate the header line 1 and then store that number in a variable $header to use in a couple of if statements. If $header equals 10 then file has expected number of fields, but if $header less than 10 file is missing header for: and the missing header fields are printed underneath. The bash seems close and if i use the awk by itself it seems to work perfectly, but I can not seem to use it in the if. Thank you :).

file.txt

Index   Chr Start   End Ref Alt Freq    Qual    Score   Input
1    1    1    100    C    -    1    GOOD    10    .
2    2    20    200    A    C    .002    STRAND BIAS    2    .
3    2    270    400    -    GG    .036    GOOD    6    .

file2.txt

Index   Chr Start   End Ref Alt Freq    Qual    Score
1    1    1    100    C    -    1    GOOD    10
2    2    20    200    A    C    .002    STRAND BIAS    2
3    2    270    400    -    GG    .036    GOOD    6

bash

for f in /home/cmccabe/Desktop/validate/*.txt; do
   bname=`basename $f`
   pref=${bname%%.txt}
   header=$(awk -F'\t' '{print NF, "fields detected in file and they are:" ORS $0; exit}') $f >> ${pref}_output  # detect header row in file and store in header and write to output
       if [[ $header == "10" ]]; then   # display results
          echo "file has expected number of fields"   # file is validated for headers
      else
          echo "file is missing header for:"  # missing header field ...in file not-validated
          echo "$header"
      fi  # close if.... else    
done >> ${pref}_output

desired output for file.txt

file has expected number of fields

desired output for file1.txt

file is missing header for:
Input

Upvotes: 1

Views: 3011

Answers (3)

David C. Rankin
David C. Rankin

Reputation: 84642

You can use awk if you like, but bash is more than capable of handling the first line fields comparison on its own. If you maintain an array of expected field names, you can then easily split the first line into fields, compare against the expected number of fields, and output the identity of the missing field if you read less than the expected number of fields from any given file.

The following is a short example that takes filenames as arguments (you need to take filenames from stdin for a large number of files, or use xargs, as required). The script simply reads the first line in each file, separates the line into fields, checks the field count, and outputs any missing fields in a short error message:

#!/bin/bash

declare -i header=10    ## header has 10 fields
## aray of field names (can be read from 1st file)
fields=( "Index"
         "Chr"
         "Start"
         "End"
         "Ref"
         "Alt"
         "Freq"
         "Qual"
         "Score"
         "Input" )

for i in "$@"; do           ## for each file given as argument
    read -r line < "$i"     ## read first line from file into 'line'

    oldIFS="$IFS"           ## save current Internal Field Separator (IFS)
    IFS=$'\t'               ## set IFS to word-split on '\t'

    fldarray=( $line );     ## fill 'fldarray' with fields in line

    IFS="$oldIFS"           ## restore original IFS

    nfields=${#fldarray[@]} ## get number of fields in 'line'

    if (( nfields < header ))   ## test against header
    then
        printf "error: only '%d' fields in file '%s'\nmissing:" "$nfields" "$i"
        for j in "${fields[@]}" ## for each expected field
        do  ## check against those in line, if not present print
            [[ $line =~ $j ]] || printf " %s" "$j"
        done
        printf "\n\n"   ## tidy up with newlines
    fi
done

Example Input

$ cat dat/hdr.txt
Index   Chr     Start   End     Ref     Alt     Freq    Qual    Score   Input
1       1       1       100     C       -       1       GOOD    10      .
2       2       20      200     A       C       .002    STRAND BIAS     2       .
3       2       270     400     -       GG      .036    GOOD    6       .

$ cat dat/hdr2.txt
Index   Chr     Start   End     Ref     Alt     Freq    Qual    Score
1       1       1       100     C       -       1       GOOD    10
2       2       20      200     A       C       .002    STRAND BIAS     2
3       2       270     400     -       GG      .036    GOOD    6

$ cat dat/hdr3.txt
Index   Chr     Start   End     Alt     Freq    Qual    Score   Input
1       1       1       100     -       1       GOOD    10      .
2       2       20      200     C       .002    STRAND BIAS     2       .
3       2       270     400     GG      .036    GOOD    6       .

Example Use/Output

$ bash hdrfields.sh dat/hdr.txt dat/hdr2.txt dat/hdr3.txt
error: only '9' fields in file 'dat/hdr2.txt'
missing: Input

error: only '9' fields in file 'dat/hdr3.txt'
missing: Ref

Look things over, while awk can do many things bash cannot on its own, bash is more than capable with parsing text.

Upvotes: 3

Matias Barrios
Matias Barrios

Reputation: 5054

This piece of code will do exactly what you are asking. Let me know if it works for you.

 for f in ./*.txt; do

      [[ $( head -1 $f | awk '{ print NF}' ) -eq 10 ]]  && echo "File $f has all the fields on its header" || echo "File $f is missing " $( echo "Index   Chr Start   End Ref Alt Freq    Qual    Score   Input $( head -1 $f )" | tr ' ' '\n' | sort | uniq -c |  awk '/1 / {print $2}' ); 
 done

Output :

File ./file2.txt is missing  Input
File ./file.txt has all the fields on its header

Upvotes: 1

James Brown
James Brown

Reputation: 37464

Here is one in GNU awk (nextfile):

$ awk '
FNR==NR {
    for(n=1;n<=NF;n++)
        a[$n]
    nextfile
}
NF==(n-1) {
    print FILENAME " file has expected number of fields"
    nextfile
}
{
    for(i=1;i<=NF;i++)
        b[$i]
    print FILENAME " is missing header for: " 
    for(i in a)
    if(i in b==0)
        print i
    nextfile
}' file1 file1 file2
file1 file has expected number of fields
file2 is missing header for: 
Input

The first file processed by the script defines the headers (in a) that the following files should have and compares them (in b) against it.

Upvotes: 1

Related Questions