Reputation: 1221
I am trying to use csplit
in BASH to separate a file by years in the 1500-1600's as delimiters.
When I do the command
csplit Shakespeare.txt '/1[56]../' '{36}'
it almost works, except for at least two issues:
xx00
through xx37
. (Also xx00
is completely blank.) I don't understand how this is possible.csplit
returns 37 non-empty files instead of the 36 non-empty files I expected) doesn't begin with 15XX or 16XX -- it begins with "ACT 4 SCENE 15\n" (where \n is supposed to denote a newline or line break). I don't understand how csplit
can match a new line/line break with a number.When I do the command (which is what I want)
csplit Shakespeare.txt '/1[56][0-9][0-9]/' '{36}'
the terminal returns the error: csplit: 1[56][0-9][0-9]: no match
plus listing all of the numbers it lists when the above is executed.
This especially doesn't make sense to me, since grep
says otherwise:
grep -c "1[56][0-9][0-9]" Shakespeare.txt
36
grep -c "1[56].." Shakespeare.txt
36
Note: man csplit
indicates that I have the BSD version from January 26, 2005. man grep
indicates that I have the BSD version from July 28, 2010.
Upvotes: 0
Views: 696
Reputation: 1221
Based on the answer given here by user 'DRL' on 06-20-2008, I decided to try adding the -k
option to csplit
.
csplit -k Shakespeare.txt '/^1[56][0-9][0-9]/' '{36}'
This returned an error: csplit: ^1[56][0-9][0-9]: no match
However, it still gave (more or less) the desired output: files xx00.txt
through xx36.txt
(not xx37.txt
), and each of the non-empty files, xx01.txt
-xx36.txt
had the expected/desired content. (In particular, no file began with "ACT 4 SCENE 15".
The man page for csplit
says the following about the -k
flag:
-k Do not remove output files if an error occurs or a HUP, INT or TERM signal is received.
Honestly I don't quite understand what this means, but I still have the following conjecture about why this solution worked/works:
Conjecture: csplit
expects the beginning of the file to match the regex. Thus, since the beginning line of the file did not match ^1[56][0-9][0-9]
, it threw a tantrum and quit without the -k
flag.
Nevertheless, I still don't understand why 1[56][0-9][0-9]
did not work, maybe the same reason. And I definitely don't understand why 1[56]..
did not work (i.e. why csplit
produced a 37th file not beginning with the pattern).
Upvotes: 0