matheus-fonseca
matheus-fonseca

Reputation: 13

Substitute first string match in 1-line 2GB file on Linux

I'm trying to substitute only the first match of one string in a huge file with only one line (2.1 GB), this substitution will occur in a shell script job. The big problem is that the machine that will run this script has only 1GB memory (approximately 300MB free), so i need a buffered strategy that don't overflow my memory. I already tried sed, perl and a python approach, but all of them return me out of memory errors. Here are my attemps (discovered in other questions):

# With perl
perl -pi -e '!$x && s/FROM_STRING/TO_STRING/ && ($x=1)' file.txt

# With sed
sed '0,/FROM_STRING/s//TO_STRING/' file.txt > file.txt.bak

# With python (in a custom script.py file)
for line in fileinput.input('file.txt', inplace=True):
    print line.replace(FROM_STRING, TO_STRING, 1)
    break

One good point is that the FROM_STRING that i'm searching is always in the beggining of this huge 1-line file, in the first 100 characters. Other good thing is that the execution time is not a problem, it can take time without problems.

EDIT (SOLUTION):

I tested three solutions of the answers, all them solved the problem, thanks for all of you. I tested the performance with Linux time and all of them take about the same time too, up to 10 seconds approximately... But i choose the @Miller solution because it's simpler (just uses perl).

Upvotes: 1

Views: 390

Answers (5)

ikegami
ikegami

Reputation: 385655

  • Given that then string to replace is in the first 100 bytes,
  • Given that Perl IO is slow unless you start using sysread to read large blocks,
  • Assuming that the substitution changes the size of the file[1], and
  • Assuming that binmode isn't needed[2],

I'd use

( head -c 100 | perl -0777pe's/.../.../' && cat ) <file.old >file.new

  1. A faster solution for that exists.
  2. Though it's easy to add if needed.

Upvotes: 1

sirosen
sirosen

Reputation: 1826

Since you know that your string is always in the first chunk of the file, you should use dd for this. You'll also need a temporary file to work with, as in tmpfile="$(mktemp)"

First, copy the first block of the file to a new, temporary location: dd bs=32k count=1 if=file.txt of="$tmpfile"

Then, do your substitution on that block: sed -i 's/FROM_STRING/TO_STRING/' "$tmpfile"

Next, concatenate the new first block with the rest of the old file, again using dd: dd bs=32k if=file.txt of="$tmpfile" seek=1 skip=1


EDIT: As per Mark Setchell's suggestion, I have added a specification of bs=32k to these commands to speed up the pace of the dd operations. This is tunable, per your needs, but if tuning separate commands distinctly, you may need to be careful about the changes in semantics between different input and output block sizes.

Upvotes: 6

leonbloy
leonbloy

Reputation: 75896

A practical (not very compact but efficient) would be to split the file, do the search-replace and join: eg:

head -c 100 myfile | sed 's/FROM/TO/' > output.1
tail -c +101 myfile > output.2
cat output.1 output.2 > output && /bin/rm output.1 output.2

Or, in one line:

( ( head -c 100 myfile | sed 's/FROM/TO/' ) && (tail -c +101 myfile ) ) > output

Upvotes: 0

Miller
Miller

Reputation: 35198

If you're certain the string you're trying to replace is just in the first 100 characters, then the following perl one-liner should work:

perl -i -pe 'BEGIN {$/ = \1024} s/FROM_STRING/TO_STRING/ .. undef' file.txt

Explanation:

Switches:

  • -i: Edit <> files in place (makes backup if extension supplied)
  • -p: Creates a while(<>){...; print} loop for each “line” in your input file.
  • -e: Tells perl to execute the code on command line.

Code:

  • BEGIN {$/ = \1024}: Set the $INPUT_RECORD_SEPARATOR to the number of characters to read for each “line”
  • s/FROM/TO/ .. undef: Use a flip-flop to perform the regex only once. Could also have used if $. == 1.

Upvotes: 2

ysth
ysth

Reputation: 98388

Untested, but I would do:

perl -pi -we 'BEGIN{$/=\65536} s/FROM_STRING/TO_STRING/ if 1..1' file.txt

to read in 64k chunks.

Upvotes: 0

Related Questions