Cody Glickman
Cody Glickman

Reputation: 524

Split large text file by nth (10000th) occurrence by two forward slashes

I am attempting to split a large file in half (or more, files sizes do not need to be equal) using the n-th occurrence of two forward slashes. I would like to keep the two forward slashes at the end of the first split file. I have tried implementing Search pattern containing forward slash using AWK in conjunction with Awk: Splitting file on nth occurence of delimiter, wrong first split file to receive

awk 'BEGIN{i=1}/^>/{cont++}cont==10000{i++;cont=1}{print > "file_"i".txt"}' Pfam-A.hmm

### Error Code Here
awk: syntax error at source line 1
 context is
    BEGIN{i=1}/^>/{cont++}cont==300{i++;cont=1}{print > >>>  "file_"i <<< ".txt"}
awk: illegal statement at source line 1

The large text file is formatted below:

Name: X
Description: This does something
Data: 
0
1
//
Name: Y
Description: This does something else
Data: 
2
3
4
5
//
Name: Z
Description: Z record description
Data: 
2
4
//
Name: Zeta
Description: This does something else too
Data: 
5
13
//

The desired output is two files containing Named records split by forward slashes.

File 1

Name: X
Description: This does something
Data: 
0
1
//
Name: Y
Description: This does something else
Data: 
2
3
4
5
//

File 2

Name: Z
Description: Z record description
Data: 
2
4
//
Name: Zeta
Description: This does something else too
Data: 
5
13
//

Upvotes: 1

Views: 115

Answers (3)

RARE Kpop Manifesto
RARE Kpop Manifesto

Reputation: 2801

The pattern you're seeking can also be described as [\n][\/][\/][\n] (or without the square brackets)

If the file isn't too big, say, within 800MB, consider just reading the full file in, and using FS = "\n\/\/\n" .

then field number $10000 will be the cutoff point you seek.

don't be afraid of large field numbers - mawk2 can easily handle 8-digit or even low-9-digit horizontal field numbers before it starts to sweat a bit.

Upvotes: 0

Ed Morton
Ed Morton

Reputation: 203502

print > "file_"i".txt" is using an unparenthesized expression on the right side of output redirection which is undefined behavior per POSIX, hence the syntax error. It needs to be print > ("file_"i".txt") instead.

FWIW I'd write your script as:

awk '
    BEGIN { out="file_" (++i) ".txt" }
    { print > out }
    /^>>/ { close(out); out="file_" (++i) ".txt" }
' Pfam-A.hmm

The above will work using any awk.

Upvotes: 0

Cyrus
Cyrus

Reputation: 88601

With GNU awk:

awk -v n=2 'BEGIN{RS=ORS="//\n"; FS=OFS="\n"; c=0} {$1=$1; if(NR%n==1){close(f); f="file_" ++c ".txt"}; print >f}' file

$1=$1 forces awk to rebuild current row.

NR%n is a modulo operation.

I use close() to prevent the maximum possible number of open files from being reached.


See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR

Upvotes: 3

Related Questions