Reputation: 524
I am attempting to split a large file in half (or more, files sizes do not need to be equal) using the n-th occurrence of two forward slashes. I would like to keep the two forward slashes at the end of the first split file. I have tried implementing Search pattern containing forward slash using AWK in conjunction with Awk: Splitting file on nth occurence of delimiter, wrong first split file to receive
awk 'BEGIN{i=1}/^>/{cont++}cont==10000{i++;cont=1}{print > "file_"i".txt"}' Pfam-A.hmm
### Error Code Here
awk: syntax error at source line 1
context is
BEGIN{i=1}/^>/{cont++}cont==300{i++;cont=1}{print > >>> "file_"i <<< ".txt"}
awk: illegal statement at source line 1
The large text file is formatted below:
Name: X
Description: This does something
Data:
0
1
//
Name: Y
Description: This does something else
Data:
2
3
4
5
//
Name: Z
Description: Z record description
Data:
2
4
//
Name: Zeta
Description: This does something else too
Data:
5
13
//
The desired output is two files containing Named records split by forward slashes.
File 1
Name: X
Description: This does something
Data:
0
1
//
Name: Y
Description: This does something else
Data:
2
3
4
5
//
File 2
Name: Z
Description: Z record description
Data:
2
4
//
Name: Zeta
Description: This does something else too
Data:
5
13
//
Upvotes: 1
Views: 115
Reputation: 2801
The pattern you're seeking can also be described as [\n][\/][\/][\n]
(or without the square brackets)
If the file isn't too big, say, within 800MB, consider just reading the full file in, and using FS = "\n\/\/\n"
.
then field number $10000
will be the cutoff point you seek.
don't be afraid of large field numbers - mawk2 can easily handle 8-digit or even low-9-digit horizontal field numbers before it starts to sweat a bit.
Upvotes: 0
Reputation: 203502
print > "file_"i".txt"
is using an unparenthesized expression on the right side of output redirection which is undefined behavior per POSIX, hence the syntax error. It needs to be print > ("file_"i".txt")
instead.
FWIW I'd write your script as:
awk '
BEGIN { out="file_" (++i) ".txt" }
{ print > out }
/^>>/ { close(out); out="file_" (++i) ".txt" }
' Pfam-A.hmm
The above will work using any awk.
Upvotes: 0
Reputation: 88601
With GNU awk:
awk -v n=2 'BEGIN{RS=ORS="//\n"; FS=OFS="\n"; c=0} {$1=$1; if(NR%n==1){close(f); f="file_" ++c ".txt"}; print >f}' file
$1=$1
forces awk
to rebuild current row.
NR%n
is a modulo operation.
I use close()
to prevent the maximum possible number of open files from being reached.
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
Upvotes: 3