Kuzeko
Kuzeko

Reputation: 1705

Perl Multiple search and replace actions in one large text file

Given a set of replacement strings in a file replacements.txt like

s/string1/replacement1/g;
s/string2/replacement2/g;
s/string3/replacement3/g;
s/string4/replacement4/g;
s/string5/replacement5/g;

I would like to obtain the equivalent of

sed -f replacements.txt infile.txt 

my file is so big that sed cant handle it, while I know that perl could do the trick.

Also the replacements are really a lot, and change from time to time. ( I need to run this a dozen of times)

Note that the replacements are fixed strings, so I do not really need those to be regular expressions.

sed has problems only when the regexp has globs and the input file is a single large line.

Upvotes: 3

Views: 1243

Answers (1)

mklement0
mklement0

Reputation: 440142

The perl equivalent of your sed command is:

perl -p replacements.txt infile.txt

It should work with your sample replacements.txt, given that the s statements are properly ;-terminated (note that sed would recognize the end of the line by itself as the statement terminator).


The real problem, however, is that the entire large file is a single line, so the key to avoiding running out of memory is to:

  • temporarily break that line into many short lines,
  • send these short lines through the pipeline and perform the string replacements on them,
  • and then re-join the modified short lines to form a single line again.

If there's a character in the data that delimits records (units of data), in a away that doesn't interfere with the string replacements, breaking the long line into multiple ones with the help of tr is a viable approach; I'll use } as an example, because Kuzeko states that the data is JSON-like:

If you have GNU sed (Linux; verify with sed --version):

tr '}' '\0' < infile.txt | sed -z -f replacements.txt | tr '\0' '}'

Having tr output NUL-separated "lines" (\0) and sed read them accordingly (-z) is the most robust way to handle the chunking.
Unfortunately, the -z / --null-data option is not POSIX-compliant and the BSD/macOS implementation does not support it.

Otherwise (e.g., on macOS):

tr '}' '\n' < infile.txt | perl -p replacements.txt infile.txt | tr '\n' '}'

Caveat: If the single line in infile.txt has a trailing \n, you'll end up with an extra } char. at the end; to prevent that, add an initial tr stage to the pipeline that deletes the \n:
tr -d '\n' < infile.txt | tr '}' '\n' | ...

perl is still needed, because - unlike BSD/macOS sed - it preserves the trailing-\n-or-not status of the input's last line.

Upvotes: 4

Related Questions