justaguy
justaguy

Reputation: 3022

print lines with blank in file using awk

In the below input file I am using awk to print out the lines that are blank in $5. The awk does run and output results but it is the entire input file, not just the lines that are blank. My awk version is GNU 4.0.1. Thank you :)

input

chr6   32945523   32945792     chr6:32945523-32945792     BRD2-351|gc=50
chr6   32945892   32946175     chr6:32945892-32946175     BRD2-352|gc=53.5
chr6   32946856   32946981     chr6:32946856-32946981
chr6   32947594   32947919     chr6:32947594-32947919     BRD2-354|gc=51.2

desired result

chr6   32946856   32946981     chr6:32946856-32946981

awk

cat input | awk 'BEGIN {FS="\t"} $5=="" {print}'

current output

cat input | awk 'BEGIN {FS="\t"} $5=="" {print}'
chr6   32945523   32945792     chr6:32945523-32945792     BRD2-351|gc=50
chr6   32945892   32946175     chr6:32945892-32946175     BRD2-352|gc=53.5
chr6   32946856   32946981     chr6:32946856-32946981
chr6   32947594   32947919     chr6:32947594-32947919     BRD2-354|gc=51.2
chr6   32948108   32948251     chr6:32948108-32948251     BRD2-355|gc=43

edit: The below awk works but I'm not sure why the original did not

awk '$5==""' input

Upvotes: 0

Views: 697

Answers (1)

Adam Katz
Adam Katz

Reputation: 16118

I'm not sure why you're specifying a field separator (FS) of tab (\t). That should only be necessary if you have a TSV file (tab-separated values, similar to CSV). If you do indeed have a TSV file, meaning there are spaces in some values and/or two consecutive tabs indicate an empty field in the middle, you need awk 'BEGIN {FS="\t"} …' or the shorter awk -F '\t' '…'.

Try this:

awk 'NF < 5' input

If you have a TSV format that includes some empty fields, try this:

awk -F '\t' '$5 == ""' input

Here's a more reliable test given HTML's inability to represent tabs:

sample() {
  echo 'chr6\t32945523\t32945792\tchr6:32945523-32945792\tBRD2-351|gc=50'
  echo 'chr6\t32945892\t32946175\tchr6:32945892-32946175\tBRD2-352|gc=53.5'
  echo 'chr6\t32946856\t32946981\tchr6:32946856-32946981'
  echo 'chr6\t32947594\t32947919\tchr6:32947594-32947919\tBRD2-354|gc=51.2'
  echo 'chr6\t32947594\t32947919\tchr6:32947594-32947919\t\ttest'
  echo 'chr6\t32947594\t\tchr6:32947594-32947919\tBRD2-354|gc=51.2'
}

echo "unfiltered"
sample

echo "testing awk 'NF < 5'"
sample |awk 'NF < 5'

echo "\ntesting awk -F '\\\\t' '\$5 == \"\"'"
sample |awk -F '\t' '$5 == ""'

The last two lines of sample() illustrate the difference between awk's default (FS="[ \t]+", matching one or more space characters) and FS="\t".

With the default, you'll get that short line plus the final line since the whitespace between fields 3 and 5 is collapsed (TSV field 5 is awk field 4). The "test" line collapses TSV field 6 into awk field 5, so the default misses it.

The altered field separator will also get that short line. It will count fields for a TSV, noting the "test" line has an empty fifth entry ("test" is its sixth entry) and the final line's missing third field is noted as empty rather than collapsed, so the "BRD2" value is properly noted as the fifth TSV field.

unfiltered
chr6    32945523    32945792    chr6:32945523-32945792  BRD2-351|gc=50
chr6    32945892    32946175    chr6:32945892-32946175  BRD2-352|gc=53.5
chr6    32946856    32946981    chr6:32946856-32946981
chr6    32947594    32947919    chr6:32947594-32947919  BRD2-354|gc=51.2
chr6    32947594    32947919    chr6:32947594-32947919      test
chr6    32947594        chr6:32947594-32947919  BRD2-354|gc=51.2

testing awk 'NF < 5'
chr6    32946856    32946981    chr6:32946856-32946981
chr6    32947594        chr6:32947594-32947919  BRD2-354|gc=51.2

testing awk -F '\t' '$5 == ""'
chr6    32946856    32946981    chr6:32946856-32946981
chr6    32947594    32947919    chr6:32947594-32947919      test

Upvotes: 1

Related Questions