Lilnet Cloud
Lilnet Cloud

Reputation: 81

Splitting text file and adding line count in header with awk in OSX

I want to do the following with my text file that contains thousands of lines

I have tried the following code that allows me to split the file up, but the number of lines present in the file (as in NR-1 " 120") is cumulative and it is printed at the very end of the split file instead of at the start.

awk '/^B/{n++; print NR-1 " 120" > filename;close(filename);next}{filename = "part" n ".txt"; print >filename}'

In my attempts to print it as a header, I have used the following code. But the supposed header does not appear at all. awk 'BEGIN{print NR-1 " 120" > filename}; /^B/{n++;close(filename);next};{filename = "part" n ".txt"; print >filename}' inputfile.txt

and the following error comes with the above code: awk: null file name in print or getline source line number 1

My text file looks something like:

>L1212 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
>L1222 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
>L1232 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
B       *        -                     |1|
>L4212 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
>L4312 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
>L4412 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
>L4512 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
B       *        -                     |2|
>L4212 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
>L4312 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
>L4412 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
>L4512 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
>L4312 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
>L4412 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
>L4512 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
B       *        -                     |3|

Update: A roundabout to using the script by @mklement0 without using Mawk or GNU awk, I used grep in textwrangler to change all lines starting with B to a single character ~.

Upvotes: 4

Views: 515

Answers (1)

mklement0
mklement0

Reputation: 439767

With GNU Awk or Mawk:

awk -v RS='\nB       \\*        -                     \\|[0-9]+\\|\n' 'NF {
  numLines = gsub("(^|\n)>", "\n") # replace line-initial ">" and count lines in block
  fname = "part" ++n               # determine next output filename
  printf "%s%s\n", numLines " 120", $0 > fname # output header + block
  close(fname)                               # close output file
}' file

Note: Unless the last line in the input file is a separator line, the last output file will have a trailing empty line (the data-line count in the header will be correct, however) - the OP has confirmed this not to be a problem.

  • GNU Awk or Mawk are needed, because only they support multi-character regex-based RS (input-record separator) values - unlike the BSD awk that macOS comes with. It is possible to solve this problem differently, but it would be a little more cumbersome.

    • Both GNU Awk and Mawk can be installed on macOS via package manager Homebrew; with Homebrew installed, simply run brew install gawk or brew install mawk.
  • The approach breaks the input into blocks of lines, by the B separator lines. Thus, each such block must fit into memory as a whole (presumably two copies at once, due to performing a string substitution.

  • Having the whole block of lines in memory before writing them to the output file is what allows counting the lines up front and adding that information to the header.

    • numLines = gsub("(^|\n)>", "\n") performs both the removal of the line-initial > chars. and determines the number of lines in the block, taking advantage of the fact that gsub() returns the number of replacements made.

Upvotes: 1

Related Questions