Reputation: 81
I want to do the following with my text file that contains thousands of lines
B
(but does not include this line).<number of lines> " 120"
)>
)I have tried the following code that allows me to split the file up, but the number of lines present in the file (as in NR-1 " 120"
) is cumulative and it is printed at the very end of the split file instead of at the start.
awk '/^B/{n++; print NR-1 " 120" > filename;close(filename);next}{filename = "part" n ".txt"; print >filename}'
In my attempts to print it as a header, I have used the following code. But the supposed header does not appear at all.
awk 'BEGIN{print NR-1 " 120" > filename}; /^B/{n++;close(filename);next};{filename = "part" n ".txt"; print >filename}' inputfile.txt
and the following error comes with the above code:
awk: null file name in print or getline
source line number 1
My text file looks something like:
>L1212 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
>L1222 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
>L1232 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
B * - |1|
>L4212 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
>L4312 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
>L4412 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
>L4512 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
B * - |2|
>L4212 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
>L4312 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
>L4412 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
>L4512 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
>L4312 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
>L4412 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
>L4512 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
B * - |3|
Update:
A roundabout to using the script by @mklement0 without using Mawk or GNU awk, I used grep in textwrangler to change all lines starting with B
to a single character ~
.
Upvotes: 4
Views: 515
Reputation: 439767
With GNU Awk or Mawk:
awk -v RS='\nB \\* - \\|[0-9]+\\|\n' 'NF {
numLines = gsub("(^|\n)>", "\n") # replace line-initial ">" and count lines in block
fname = "part" ++n # determine next output filename
printf "%s%s\n", numLines " 120", $0 > fname # output header + block
close(fname) # close output file
}' file
Note: Unless the last line in the input file is a separator line, the last output file will have a trailing empty line (the data-line count in the header will be correct, however) - the OP has confirmed this not to be a problem.
GNU Awk or Mawk are needed, because only they support multi-character regex-based RS
(input-record separator) values - unlike the BSD awk
that macOS comes with. It is possible to solve this problem differently, but it would be a little more cumbersome.
brew install gawk
or brew install mawk
.The approach breaks the input into blocks of lines, by the B
separator lines. Thus, each such block must fit into memory as a whole (presumably two copies at once, due to performing a string substitution.
Having the whole block of lines in memory before writing them to the output file is what allows counting the lines up front and adding that information to the header.
numLines = gsub("(^|\n)>", "\n")
performs both the removal of the line-initial >
chars. and determines the number of lines in the block, taking advantage of the fact that gsub()
returns the number of replacements made.Upvotes: 1