Reputation: 151
I have a 5000 lines file consisting of blocks of lines, with an END string between blocks, as follows
ATOM 1
ATOM 3
ATOM 25
END
ATOM 2
ATOM 36
ATOM 22
ATOM 12
END
ATOM 1
ATOM 87
END
I want to find a way to split the file into several files, each containing a single block of lines before the END string. The first file should look as follows:
ATOM 1
ATOM 3
ATOM 25
The second file should contain
ATOM 2
ATOM 36
ATOM 22
ATOM 12
And so on. I have thought of using something like awk '/END/{flag=1; next} /END/{flag=0} flag' file
to take the blocks between the END strings. This, however, does not work for my first block, as the END string is only after the block, and most importantly, cannot take into account the number of times it has found the string END to separate each block into its individual file.
Is there a way I can use the string END to split my file into several, each containing a block that ends with the string END?
Upvotes: 1
Views: 3100
Reputation: 104102
Few other ways.
Perl:
perl -0777 -lnE 'while (/([\s\S]*?)^END\s*/gm) {
$cnt++;
open(FH, ">file_${cnt}.txt");
print FH $1;
close (FH);
}' file
Ruby:
ruby -e 'cnt=1; s=$<.read.scan(/([\s\S]*?)^END\s*/m) { |b|
File.write("file_#{cnt}.txt", b.join(""))
cnt+=1
}' file
Any awk:
awk 'BEGIN { i=1; fn=sprintf("file_%s.txt", i) }
$1=="END" { close(fn); fn=sprintf("file_%s.txt", ++i); next }
{print > fn }
' file
Or, you can use sed
and process substitution with Bash (Note -- this only works if the file is properly terminated with a final new line.)
while IFS= read -r -d $'\3' block; do
(( i++ ))
printf "%s" "$block" > "file_${i}.txt"
done < <(sed '/^END[[:space:]]*$/N; s/^END[[:space:]]*/\x3/' file)
Any of these results in:
head file_*.txt
==> file_1.txt <==
ATOM 1
ATOM 3
ATOM 25
==> file_2.txt <==
ATOM 2
ATOM 36
ATOM 22
ATOM 12
==> file_3.txt <==
ATOM 1
ATOM 87
# ^ Note final file has proper \n termination
Upvotes: 0
Reputation: 58578
This might work for you (GNU csplit):
csplit -qz -f file -b '%04d.txt' --suppress-matched file '/END/' '{*}'
Be quiet and elide any empty files.
Prefix the output files with file
and suffix with four digits plus .txt
.
Suppress the matching lines e.g. END
.
Repeat until the end of the file.
If you do not mind files defaulting to xxn
use:
csplit -qz --sup file '/END/' '{*}'
Upvotes: 1
Reputation: 67567
$ awk '/END/{c++; next} {print > ("file."(c+1)".txt")}' file
==> file.1.txt <==
ATOM 1
ATOM 3
ATOM 25
==> file.2.txt <==
ATOM 2
ATOM 36
ATOM 22
ATOM 12
==> file.3.txt <==
ATOM 1
ATOM 87
If you have too many sections eventually may run into too many files open issue. So, better to close the files when done.
$ awk 'BEGIN {f="file."(++c)".txt"}
/END/ {close(f); f="file"(++c)".txt"; next}
{print > f}' file
Upvotes: 1
Reputation: 204638
Using any awk:
$ awk -v cnt=1 '
/END/ { cnt++; next }
cnt != prev { close(out); out="foo" cnt ".txt"; prev=cnt }
{ print > out }
' file
$ head foo*.txt
==> foo1.txt <==
ATOM 1
ATOM 3
ATOM 25
==> foo2.txt <==
ATOM 2
ATOM 36
ATOM 22
ATOM 12
==> foo3.txt <==
ATOM 1
ATOM 87
Upvotes: 3
Reputation: 2687
awk
's record
separator (RS
) can be reset to read blocks separated by the word "END", and each block can be printed to a file with a numerically incremented filename as follows:
awk 'BEGIN{RS="END";ORS="";i=1;} {print > "part"i".file"; i++}' file.txt
The output record separator ORS
has been set to an empty string to prevent additional new lines at the end of the file. Files after the first part still have a leading empty line that could be removed if essential. It also creates an additional empty file that can be ignored for this 'quick and dirty' solution.
An incremented counter i
is used to form sequential file names.
output examined from the above procedure run with a file copy of your input:
> ls part*
part1.file part2.file part3.file part4.file
> cat part1.file
ATOM 1
ATOM 3
ATOM 25
>cat part2.file
ATOM 2
ATOM 36
ATOM 22
ATOM 12
(part4.file is empty)
possible problem: some versions of awk apparently don't like concatenation for filenames receiving a direct print redirection. If an error occurs here, the filename can be preformed in the slightly longer version:
awk 'BEGIN{RS="END";ORS="";i=1;} {flname="part"i".file"; print > flname; i++}' file.txt
Upvotes: 2
Reputation: 142005
Close. Increment the flag each block. And output to a file. In awk:
awk 'BEGIN{flag=0} /END/{flag++} {print $0 > flag ".txt"}' file
In Bash:
flag=0
while IFS= read -r line; do
if [[ "$line" = "END" ]]; then
flag=$((flag + 1))
else
printf "%s\n" "$line" >> "$flag.txt"
fi
done <inputfile
etc in any other programming language.
Upvotes: 2