Reputation: 991
I have a file, x
, with section delimiters:
The first section
#!
The second section
#!
The third section
And I want to split it up into a sequence of separate files, like:
The first section
#!
The second section
#!
The third section
I thought csplit
would be the solution, with a command-line something like:
$ csplit -sk x '/#!/' {9999}
But the second file (xx01
) ends up containing both delimiters:
#!
The second section
#!
Any ideas for how to accomplish what I want in a POSIX compliant way? (Yes, I could reach for Perl/Python/Ruby and friends; but, the point is to stretch my shell knowledge.)
I worry that I've found a bug in OSX csplit. Can people give the following a go and let me know the results?
#!/bin/sh
test -e
work="$(basename $0).$RANDOM"
mkdir $work
csplit -sk -f "$work/" - '/#/' '{9999}' <<EOF
First
#
Second
#
Third
EOF
if [ $(grep -c '#' $work/01) -eq 2 ]; then
echo FAIL Repeat
else
echo PASS Repeat
fi
rm $work/*
csplit -sk -f "$work/" - '/#/' '/#/' <<EOF
First
#
Second
#
Third
EOF
if [ $(grep -c '#' $work/01) -eq 2 ]; then
echo FAIL Exact
else
echo PASS Exact
fi
uname -a
When I run it on my Snow Leopard box, I get:
$ ./csplit-test
csplit: #: no match
FAIL Repeat
PASS Exact
Darwin lani.bigpond 11.2.0 Darwin Kernel Version 11.2.0: Tue Aug 9 20:54:00 PDT 2011; root:xnu-1699.24.8~1/RELEASE_X86_64 x86_64
And on my Debian box, I get:
$ sh ./csplit-test
csplit: `/#/': match not found on repetition 2
PASS Repeat
PASS Exact
Upvotes: 3
Views: 2615
Reputation: 36262
Using awk and testing it in a linux machine:
My version of awk:
$ awk --version | head -1
GNU Awk 4.0.0
Content of infile:
$ cat infile
The first section
#!
The second section
#!
The third section
Content of the awk script:
$ cat script.awk
BEGIN {
## Set 'Input Record Separator' variable.
RS = "#!";
}
{
## Set an integer variable as output file name.
++filenum;
}
## For first section.
FNR == 1 {
## Remove leading and trailing spaces.
sub( /^\s+/, "", $0);
sub( /\s+$/, "", $0);
## Print to output file.
printf "%s\n", $0 > filenum ".txt"
}
## For sections from second one to last one.
FNR > 1 {
## Remove trailing spaces.
sub( /\s+$/, "", $0);
## Print to output file.
printf "%s%s\n", RS, $0 > filenum ".txt"
}
Running the script:
$ awk -f script.awk infile
Check output:
$ ls [0-9].txt
1.txt 2.txt 3.txt
$ cat 1.txt
The first section
$ cat 2.txt
#!
The second section
$ cat 3.txt
#!
The third section
Upvotes: 1
Reputation: 77105
Though not ideal, but you can do something like this with awk
.
Your file:
[jaypal:~/Temp] cat f0
The first section
#!
The second section
#!
The third section
Get everything before #!
using this (you can redirect this in a file)
[jaypal:~/Temp] awk '/#!/{exit;}1' f0
The first section
Get #!
followed by the content and split before the next #!
.
[jaypal:~/Temp] awk '/^#!/{x++}{print >(x".txt")}' f0
[jaypal:~/Temp] ls *.txt
1.txt 2.txt
[jaypal:~/Temp] cat 1.txt
#!
The second section
[jaypal:~/Temp] cat 2.txt
#!
The third section
You might get an easy way around with perl
using something like this -
#!/usr/bin/perl
undef $/;
$_ = <>;
$n = 0;
for $match (split(/(?=#!)/)) {
open(O, '>temp' . ++$n);
print O $match;
close(O);
}
Files created by script:
[jaypal:~/Temp] cat temp1
The first section
[jaypal:~/Temp] cat temp2
#!
The second section
[jaypal:~/Temp] cat temp3
#!
The third section
Upvotes: 1
Reputation: 11
Uh oh. (FreeBSD 8.1 install running in a Parallels VM)
src ./test_split.sh
csplit: #: no match
FAIL Repeat
PASS Exact
FreeBSD <hostname> 8.1-RELEASE FreeBSD 8.1-RELEASE #0: Mon Jul 19 02:55:53 UTC 2010 [email protected]:/usr/obj/usr/src/sys/GENERIC i386
Upvotes: 1
Reputation: 33732
this seems to work for me on LINUX:
csplit -sk filename '/#!/' {*}
giving:
$ more xx00
The first section
$ more xx01
#!
The second section
$ more xx02
#!
The third section
you could also use Ruby or Perl to do this in a tiny script, and get rid of the delimiters all together
on Fedora 13 Linux:
$ ./test.sh
csplit: `/#/': match not found on repetition 2
PASS Repeat
PASS Exact
Linux localhost.localdomain 2.6.34.8-68.fc13.x86_64 #1 SMP Thu Feb 17 15:03:58 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux
Upvotes: 2