Chill2Macht
Chill2Macht

Reputation: 1221

Why don't `csplit` and `grep` agree on whether there are matches?

I am trying to use csplit in BASH to separate a file by years in the 1500-1600's as delimiters.

When I do the command

csplit Shakespeare.txt '/1[56]../' '{36}'

it almost works, except for at least two issues:

  1. This outputs 38 files, not 36, numbered xx00 through xx37. (Also xx00 is completely blank.) I don't understand how this is possible.
  2. One of the files (why, it seems, that csplit returns 37 non-empty files instead of the 36 non-empty files I expected) doesn't begin with 15XX or 16XX -- it begins with "ACT 4 SCENE 15\n" (where \n is supposed to denote a newline or line break). I don't understand how csplit can match a new line/line break with a number.

When I do the command (which is what I want)

csplit Shakespeare.txt '/1[56][0-9][0-9]/' '{36}'

the terminal returns the error: csplit: 1[56][0-9][0-9]: no match plus listing all of the numbers it lists when the above is executed.

This especially doesn't make sense to me, since grep says otherwise:

grep -c "1[56][0-9][0-9]" Shakespeare.txt
36

grep -c "1[56].." Shakespeare.txt
36

Note: man csplit indicates that I have the BSD version from January 26, 2005. man grep indicates that I have the BSD version from July 28, 2010.

Upvotes: 0

Views: 696

Answers (1)

Chill2Macht
Chill2Macht

Reputation: 1221

Based on the answer given here by user 'DRL' on 06-20-2008, I decided to try adding the -k option to csplit.

csplit -k Shakespeare.txt '/^1[56][0-9][0-9]/' '{36}'

This returned an error: csplit: ^1[56][0-9][0-9]: no match

However, it still gave (more or less) the desired output: files xx00.txt through xx36.txt (not xx37.txt), and each of the non-empty files, xx01.txt-xx36.txt had the expected/desired content. (In particular, no file began with "ACT 4 SCENE 15".

The man page for csplit says the following about the -k flag:

-k Do not remove output files if an error occurs or a HUP, INT or TERM signal is received.

Honestly I don't quite understand what this means, but I still have the following conjecture about why this solution worked/works:

Conjecture: csplit expects the beginning of the file to match the regex. Thus, since the beginning line of the file did not match ^1[56][0-9][0-9], it threw a tantrum and quit without the -k flag.

Nevertheless, I still don't understand why 1[56][0-9][0-9] did not work, maybe the same reason. And I definitely don't understand why 1[56].. did not work (i.e. why csplit produced a 37th file not beginning with the pattern).

Upvotes: 0

Related Questions