aish1249
aish1249

Reputation: 23

Extract large list of lines from large text file

I need to extract ~5000 lines from a file with ~300,000 lines on bash (OSX). Running

sed '128082p;128083p;...(4996 numbers)....;159845q;d' file > output

gives the error

sed: 1: "128082p;128083p;128084p ...": command expected

This same command works if I try to extract 10 lines only. Whereas running

for i in `cat line_file`; do sed -n "$ip" file; done >> output

creates a file that's more than ~5000 lines long. What's the right command in either case?

Edit: this is not a range of numbers.

Upvotes: 2

Views: 356

Answers (1)

mklement0
mklement0

Reputation: 437833

Tip of the hat to Jonathan Leffler for his help.

It looks like BSD sed as used on macOS (as of macOS 10.12.1) has a hard limit on the size of each line of a script that can be passed to it: 2048 bytes.

When passed as a command-line argument (implicitly as the first operand, or explicitly via -e options), scripts are typically passed as a single line, as you did.

If that single line gets too long, it is regrettably blindly cut off, typically resulting in a seemingly random syntax error, like the one you saw.

There are two workarounds:

  • Make sure that your script contains only short-enough lines by separating commands with \n (newlines) instead of ; and/or split your script across multiple -e options (which is cumbersome).

  • Provide the entire script via a file, using the -f option, in which case all commands must be separated with \n rather than ; anyway.
    In the unlikely event that your script is too long to fit on a single command line (a limit imposed by the system - see bottom), using -f is your only option.


Here's an example of a command-line script that is too long:

$ sed -n "$(printf '%sp;' {1..432})" <<<'line 1'
sed: 1: "1p;2p;3p;4p;5p;6p;7p;8p ...": command expected # !! ERROR

Even though the script is syntactically correct, cutting its one and only line off at 2048 bytes leaves it incorrect, resulting in the seemingly random command expected error.

In this case, working around the limitation is simple: by replacing ; with \n, the individual lines become short enough:

$ sed -n "$(printf '%sp\n' {1..432})" <<<'line 1'
line 1 # OK

Since you already have a file of line numbers - line_file - you can use an auxiliary sed command to create your \n-separated script from it:

 $ sed -n "$(sed 's/$/p/' line_file)" file > output

Here's how to solve the problem via a script file passed via -f, in which the commands are \n-separated fixes the problem:

$ printf '%sp\n' {1..432} > script.sed # Create script file with \n-separated commands.
$ sed -n -f "script.sed" <<<'line 1' # Pass script file via -f
line 1 # OK

Note: Using a process substitution (sed -n -f <(printf ...) ...) as an ad-hoc script file inexplicably does not work.

Also note that the overall max. length of a command line for invoking an external utility such as sed on macOS (as of 10.12) is 262144 (256 KB; determined with getconf ARG_MAX), and in practice the limit is lower, because the size of the environment-variable block plays a role.
If you were to hit that limit, however, you'd get a more helpful error message: Argument list too long.

Upvotes: 3

Related Questions