Reputation: 575

next command in gawk not producing expected result

I'm trying to skip the whole first section of a bunch of tab-delimited text files. (I converted to comma-delimited for the sample data.) I just can't seem to figure out why this doesn't work:

CODE

gawk '
  /[^Country Of Sale]/ {next}
  /^Cloud Total/ {nextfile}
  FNR > 1 {$0 =  FILENAME OFS $0; print}
' OFS='\t' /path/to/files/*.txt > path/to/new_file.txt

DATA

"Start Date","End Date","UPC" "4/2/17","5/6/17","SKIP THIS LINE"
"4/2/17","5/6/17","SKIP THIS LINE" "4/2/17","5/6/17","SKIP THIS LINE"
"4/2/17","5/6/17","SKIP THIS LINE" "4/2/17","5/6/17","SKIP THIS LINE"
"Row Count","447","SKIP THIS LINE" 
"Country Of Sale","Total","Total  Units1","Total Units2","Total C_F","SPCU","PCUT","CPS","USPS","Total  Share","EffSUBS","ActSUBS"
"AU","0","139851331","139851331","195833.36","0.001400297","1170","1.36","","1.36","91704.63","99430"
"Cloud Total","1.36" "Sales Total","243.18" "Total Amount","244.54"

EXPECTED OUTPUT

"Country Of Sale","Total","Total  Units1","Total Units2","Total C_F","SPCU","PCUT","CPS","USPS","Total  Share","EffSUBS","ActSUBS"
"AU","0","139851331","139851331","195833.36","0.001400297","1170","1.36","","1.36","91704.63","99430"

Also, I'd like to make the "Country Of Sale" line the header for all the files. But NR & FNR start counting at the beginning. How can I do that, given that "Country Of Sale" appears in a different line number in each file?

Thanks for any help!

Upvotes: 1

Answers (3)

Steve

Reputation: 575

Thanks to @EdMorton @ @JonathanLeffler for giving me the necessary clues. What ended up working was using /^Country Of Sale/{next} & /^Cloud Total/ {nextfile}. Next, I'll go figure out exactly *why* this worked!

Upvotes: -1

Jonathan Leffler

Reputation: 754470

As I noted in the comments, /[^Country Of Sale]/ probably isn't doing what you think it should. Hint: one of the repeated blanks is superfluous. (It just so happens that the blank is the only repeated character in that negated character class.)

What it actually does is looks for any character except one of [ COSaeflnortuy] (the square brackets are metacharacters) and jumps to the next line if it finds one. For example, if the line contains a double quote or a comma, it will jump to the next line of input (because neither double quote nor comma is listed in the square brackets).

Note that in your CSV data, "Cloud Total" does not start the line with C; it starts with a double quote. Unfortunately, your regex searching for it insists that the C must be the first character.

I think you need something like:

gawk 'FNR==1,/Country Of Sale/ { next }
      /Cloud Total/ { nextfile }
      { print }' data

That lists just the AU line in the given data (and if you list the same file 3 times on a single command line, you get 3 lines starting with AU, so it works OK across files, in part because of the range FNR==1,/…/).

You should be able to take it from there. You can make the patterns more restrictive (/^"Country Of Sale",/ etc.) if you wish. You can use { print FILENAME OFS $0 } to print the line prefixed by the file name and an output field separator (a tab in your command line).

This, and @Ed's suggestion too, both give all of the lines of data, instead of just what's between "Country Of Sale" and "Cloud Total".

This is what I get (on a Mac running macOS Sierra 10.12.6, using a home-built GNU Awk 4.1.3, API: 1.1):

$ cat data
"Start Date","End Date","UPC" "4/2/17","5/6/17","SKIP THIS LINE"
"4/2/17","5/6/17","SKIP THIS LINE" "4/2/17","5/6/17","SKIP THIS LINE"
"4/2/17","5/6/17","SKIP THIS LINE" "4/2/17","5/6/17","SKIP THIS LINE"
"Row Count","447","SKIP THIS LINE" 
"Country Of Sale","Total","Total  Units1","Total Units2","Total C_F","SPCU","PCUT","CPS","USPS","Total  Share","EffSUBS","ActSUBS"
"AU","0","139851331","139851331","195833.36","0.001400297","1170","1.36","","1.36","91704.63","99430"
"Cloud Total","1.36" "Sales Total","243.18" "Total Amount","244.54"
$ gawk 'FNR==1,/Country Of Sale/{next} /Cloud Total/ {nextfile} { print }' data data data
"AU","0","139851331","139851331","195833.36","0.001400297","1170","1.36","","1.36","91704.63","99430"
"AU","0","139851331","139851331","195833.36","0.001400297","1170","1.36","","1.36","91704.63","99430"
"AU","0","139851331","139851331","195833.36","0.001400297","1170","1.36","","1.36","91704.63","99430"
$

Given that I gave it the file to process 3 times, that's what I'd expect and appears to be what you'd want.

If you want the "Country Of Sale" heading line in the output, that can be added easily enough:

gawk 'FNR==1,/Country Of Sale/ { if ($0 ~ /Country Of Sale/) print; next }
      /Cloud Total/ { nextfile }
      { print }' data

And if you want the header only once even though it appears in many files, then:

gawk 'FNR==1,/Country Of Sale/ { if ($0 ~ /Country Of Sale/ && hdr_count++ == 0) print; next }
      /Cloud Total/ { nextfile }
      { print }' data

Upvotes: 2

Ed Morton

Reputation: 203985

[...] is a bracket expression which includes a list, set or range of characters. It does NOT contain a string or a negation of a string.

[^Country Of Sale] = [^aCFelnoOrStuy]

when you probably meant:

!/Country Of Sale/

which still isn't what you actually need. Try this:

gawk '
  BEGIN { FS=OFS="\t" }
  /Country Of Sale/ { f=1 }
  /Cloud Total/ { f=0; nextfile }
  f { print FILENAME, $0 }
' RAW/iTunes/iTunesMatch/*.txt > munched/iTunesMatch_TEST.txt

Look:

$ cat file
"Start Date","End Date","UPC" "4/2/17","5/6/17","SKIP THIS LINE"
"4/2/17","5/6/17","SKIP THIS LINE" "4/2/17","5/6/17","SKIP THIS LINE"
"4/2/17","5/6/17","SKIP THIS LINE" "4/2/17","5/6/17","SKIP THIS LINE"
"Row Count","447","SKIP THIS LINE"
"Country Of Sale","Total","Total  Units1","Total Units2","Total C_F","SPCU","PCUT","CPS","USPS","Total  Share","EffSUBS","ActSUBS"
"AU","0","139851331","139851331","195833.36","0.001400297","1170","1.36","","1.36","91704.63","99430"
"Cloud Total","1.36" "Sales Total","243.18" "Total Amount","244.54"

$ gawk '
   BEGIN { FS=OFS="\t" }
   /Country Of Sale/ { f=1 }
   /Cloud Total/ { f=0; nextfile }
   f { print FILENAME, $0 }
' file
file    "Country Of Sale","Total","Total  Units1","Total Units2","Total C_F","SPCU","PCUT","CPS","USPS","Total  Share","EffSUBS","ActSUBS"
file    "AU","0","139851331","139851331","195833.36","0.001400297","1170","1.36","","1.36","91704.63","99430"

If you have multiple input files and only wanted the Country of Sale line to appear once then one approach would be:

$ gawk '
   BEGIN { FS=OFS="\t" }
   /Country Of Sale/ { f=1; if (NR==FNR) print FILENAME, $0; next}
   /Cloud Total/ { f=0; nextfile }
   f { print FILENAME, $0 }
' file file file
file    "Country Of Sale","Total","Total  Units1","Total Units2","Total C_F","SPCU","PCUT","CPS","USPS","Total  Share","EffSUBS","ActSUBS"
file    "AU","0","139851331","139851331","195833.36","0.001400297","1170","1.36","","1.36","91704.63","99430"
file    "AU","0","139851331","139851331","195833.36","0.001400297","1170","1.36","","1.36","91704.63","99430"
file    "AU","0","139851331","139851331","195833.36","0.001400297","1170","1.36","","1.36","91704.63","99430"

Upvotes: 2

next command in gawk not producing expected result

Answers (3)

Related Questions