user19619903
user19619903

Reputation: 151

Split a file using a pattern as a delimiter

I have a 5000 lines file consisting of blocks of lines, with an END string between blocks, as follows

ATOM 1
ATOM 3
ATOM 25
END 
ATOM 2
ATOM 36
ATOM 22
ATOM 12 
END 
ATOM 1
ATOM 87
END 

I want to find a way to split the file into several files, each containing a single block of lines before the END string. The first file should look as follows:

ATOM 1
ATOM 3
ATOM 25

The second file should contain

ATOM 2
ATOM 36
ATOM 22
ATOM 12 

And so on. I have thought of using something like awk '/END/{flag=1; next} /END/{flag=0} flag' file to take the blocks between the END strings. This, however, does not work for my first block, as the END string is only after the block, and most importantly, cannot take into account the number of times it has found the string END to separate each block into its individual file. Is there a way I can use the string END to split my file into several, each containing a block that ends with the string END?

Upvotes: 1

Views: 3100

Answers (6)

dawg
dawg

Reputation: 104102

Few other ways.

Perl:

perl -0777 -lnE 'while (/([\s\S]*?)^END\s*/gm) {
    $cnt++;
    open(FH, ">file_${cnt}.txt");
    print FH $1;
    close (FH);
}' file 

Ruby:

ruby -e 'cnt=1; s=$<.read.scan(/([\s\S]*?)^END\s*/m) { |b|
    File.write("file_#{cnt}.txt", b.join(""))
    cnt+=1
}' file 

Any awk:

awk 'BEGIN { i=1; fn=sprintf("file_%s.txt", i) }
    $1=="END" { close(fn); fn=sprintf("file_%s.txt", ++i); next }
    {print > fn }
' file 

Or, you can use sed and process substitution with Bash (Note -- this only works if the file is properly terminated with a final new line.)

while IFS= read -r -d $'\3' block; do
    (( i++ ))
    printf "%s" "$block" > "file_${i}.txt"
done < <(sed '/^END[[:space:]]*$/N; s/^END[[:space:]]*/\x3/' file)

Any of these results in:

head file_*.txt
==> file_1.txt <==
ATOM 1
ATOM 3
ATOM 25

==> file_2.txt <==
ATOM 2
ATOM 36
ATOM 22
ATOM 12 

==> file_3.txt <==
ATOM 1
ATOM 87

# ^ Note final file has proper \n termination

Upvotes: 0

potong
potong

Reputation: 58578

This might work for you (GNU csplit):

csplit -qz -f file -b '%04d.txt' --suppress-matched file '/END/' '{*}'

Be quiet and elide any empty files.

Prefix the output files with file and suffix with four digits plus .txt.

Suppress the matching lines e.g. END.

Repeat until the end of the file.


If you do not mind files defaulting to xxn use:

csplit -qz --sup file '/END/' '{*}'

Upvotes: 1

karakfa
karakfa

Reputation: 67567

$ awk '/END/{c++; next} {print > ("file."(c+1)".txt")}' file



==> file.1.txt <==
ATOM 1
ATOM 3
ATOM 25

==> file.2.txt <==
ATOM 2
ATOM 36
ATOM 22
ATOM 12

==> file.3.txt <==
ATOM 1
ATOM 87

If you have too many sections eventually may run into too many files open issue. So, better to close the files when done.

$ awk 'BEGIN {f="file."(++c)".txt"} 
       /END/ {close(f); f="file"(++c)".txt"; next} 
             {print > f}' file

Upvotes: 1

Ed Morton
Ed Morton

Reputation: 204638

Using any awk:

$ awk -v cnt=1 '
    /END/ { cnt++; next }
    cnt != prev { close(out); out="foo" cnt ".txt"; prev=cnt }
    { print > out }
' file

$ head foo*.txt
==> foo1.txt <==
ATOM 1
ATOM 3
ATOM 25

==> foo2.txt <==
ATOM 2
ATOM 36
ATOM 22
ATOM 12

==> foo3.txt <==
ATOM 1
ATOM 87

Upvotes: 3

Dave Pritlove
Dave Pritlove

Reputation: 2687

awk's record separator (RS) can be reset to read blocks separated by the word "END", and each block can be printed to a file with a numerically incremented filename as follows:

awk 'BEGIN{RS="END";ORS="";i=1;} {print > "part"i".file"; i++}' file.txt

The output record separator ORS has been set to an empty string to prevent additional new lines at the end of the file. Files after the first part still have a leading empty line that could be removed if essential. It also creates an additional empty file that can be ignored for this 'quick and dirty' solution.

An incremented counter i is used to form sequential file names.

output examined from the above procedure run with a file copy of your input:

> ls part*
part1.file  part2.file  part3.file  part4.file
> cat part1.file
ATOM 1
ATOM 3
ATOM 25
>cat part2.file
 
ATOM 2
ATOM 36
ATOM 22
ATOM 12 

(part4.file is empty)

possible problem: some versions of awk apparently don't like concatenation for filenames receiving a direct print redirection. If an error occurs here, the filename can be preformed in the slightly longer version:

awk 'BEGIN{RS="END";ORS="";i=1;} {flname="part"i".file"; print > flname; i++}' file.txt

Upvotes: 2

KamilCuk
KamilCuk

Reputation: 142005

Close. Increment the flag each block. And output to a file. In awk:

awk 'BEGIN{flag=0} /END/{flag++} {print $0 > flag ".txt"}' file

In Bash:

flag=0
while IFS= read -r line; do
   if [[ "$line" = "END" ]]; then
      flag=$((flag + 1))
   else
      printf "%s\n" "$line" >> "$flag.txt"
   fi
done <inputfile

etc in any other programming language.

Upvotes: 2

Related Questions