Reputation: 147

Optimize sed for multiple replacements

I have a file, users.txt, with words like,

user1
user2
user3

I want to find these words in another file, data.txt and add a prefix to it. data.txt has nearly 500K lines. For example, user1 should be replaced with New_user1 and so on. I have written simple shell script like

for user in `cat users.txt`
do
    sed -i 's/'${user}'/New_&/' data.txt
done

For ~1000 words, this program is taking minutes to process, which surprised me because sed is very fast when to comes to find and replace. I tried to refer to Optimize shell script for multiple sed replacements, but still not much improvement was observed.

Is there any other way to make this process faster?

Upvotes: 2

Answers (3)

blackpen

Reputation: 2424

Sed is known to be very fast (probably only worse than C).

Instead of sed 's/X/Y/g' input.txt, try sed '/X/ s/X/Y/g' input.txt. The latter is known to be faster.

Since you only have a "one line at a time semantics", you could run it with parallel (on multi-core cpu-s) like this:

cat huge-file.txt | parallel --pipe sed -e '/xxx/ s/xxx/yyy/g'

If you are working with plain ascii files, you could speed it up by using "C" locale:

LC_ALL=C sed -i -e '/xxx/ s/xxx/yyy/g' huge-file.txt

Upvotes: 5

User9102d82

Reputation: 1190

Or.. in one go, we can do something like this. Let us say, we have a data file with 500k lines.

$>    
wc -l data.txt
500001 data.txt

$>    
ls -lrtha data.txt
-rw-rw-r--. 1 gaurav gaurav 16M Oct  5 00:25 data.txt

$>
head -2 data.txt  ; echo ; tail -2 data.txt
0|This is a test file maybe
1|This is a test file maybe

499999|This is a test file maybe
500000|This is a test file maybe

Let us say that our users.txt has 3-4 keywords, which are to be prefixed with "ab_", in the file "data.txt"

$>    
cat users.txt
file
maybe
test

So we want to read users.txt and for every word, we want to change that word to a new word. For ex., "file" to "ab_file", "maybe" to "ab_maybe"..

We can run a while loop, read the input words to be prefixed one by one, and then we run a perl command over the file with the input word stored in a variable. In below example, read word is passed to perl command as $word.

I timed this task and this happens fairly quickly. Did it on my VM hosted on my windows 10 (using Centos7).

time cat users.txt |while read word; do  perl -pi -e "s/${word}/ab_${word}/g" data.txt; done        
real    0m1.973s
user    0m1.846s
sys     0m0.127s
$>    
head -2 data.txt  ; echo ; tail -2 data.txt
0|This is a ab_test ab_file ab_maybe
1|This is a ab_test ab_file ab_maybe

499999|This is a ab_test ab_file ab_maybe
500000|This is a ab_test ab_file ab_maybe

In above code, we read the words: test, file, maybe and changed it to ab_test, ab_file, ab_maybe in the data.txt file. head and tail count confirms our operation.

cheers, Gaurav

Upvotes: 1

Benjamin W.

Reputation: 52556

You can turn your users.txt into sed commands like this:

$ sed 's|.*|s/&/New_&/|' users.txt 
s/user1/New_user1/
s/user2/New_user2/
s/user3/New_user3/

And then use this to process data.txt, either by writing the output of the previous command to an intermediate file, or with process substitution:

sed -f <(sed 's|.*|s/&/New_&/|' users.txt) data.txt

Your approach goes through all of data.txt for every single line in users.txt, which makes it slow.

If you can't use process substitution, you can use

sed 's|.*|s/&/New_&/|' users.txt | sed -f - data.txt

instead.

Upvotes: 3

Optimize sed for multiple replacements

Answers (3)

Related Questions