Reputation: 147
I have a file, users.txt
, with words like,
user1
user2
user3
I want to find these words in another file, data.txt
and add a prefix to it. data.txt
has nearly 500K lines. For example, user1
should be replaced with New_user1
and so on. I have written simple shell script like
for user in `cat users.txt`
do
sed -i 's/'${user}'/New_&/' data.txt
done
For ~1000 words, this program is taking minutes to process, which surprised me because sed is very fast when to comes to find and replace. I tried to refer to Optimize shell script for multiple sed replacements, but still not much improvement was observed.
Is there any other way to make this process faster?
Upvotes: 2
Views: 1537
Reputation: 2424
Sed is known to be very fast (probably only worse than C).
Instead of sed 's/X/Y/g' input.txt
, try sed '/X/ s/X/Y/g' input.txt
. The latter is known to be faster.
Since you only have a "one line at a time semantics", you could run it with parallel
(on multi-core cpu-s) like this:
cat huge-file.txt | parallel --pipe sed -e '/xxx/ s/xxx/yyy/g'
If you are working with plain ascii files, you could speed it up by using "C" locale:
LC_ALL=C sed -i -e '/xxx/ s/xxx/yyy/g' huge-file.txt
Upvotes: 5
Reputation: 1190
Or.. in one go, we can do something like this. Let us say, we have a data file with 500k lines.
$>
wc -l data.txt
500001 data.txt
$>
ls -lrtha data.txt
-rw-rw-r--. 1 gaurav gaurav 16M Oct 5 00:25 data.txt
$>
head -2 data.txt ; echo ; tail -2 data.txt
0|This is a test file maybe
1|This is a test file maybe
499999|This is a test file maybe
500000|This is a test file maybe
Let us say that our users.txt has 3-4 keywords, which are to be prefixed with "ab_", in the file "data.txt"
$>
cat users.txt
file
maybe
test
So we want to read users.txt and for every word, we want to change that word to a new word. For ex., "file" to "ab_file", "maybe" to "ab_maybe"..
We can run a while loop, read the input words to be prefixed one by one, and then we run a perl command over the file with the input word stored in a variable. In below example, read word is passed to perl command as $word.
I timed this task and this happens fairly quickly. Did it on my VM hosted on my windows 10 (using Centos7).
time cat users.txt |while read word; do perl -pi -e "s/${word}/ab_${word}/g" data.txt; done
real 0m1.973s
user 0m1.846s
sys 0m0.127s
$>
head -2 data.txt ; echo ; tail -2 data.txt
0|This is a ab_test ab_file ab_maybe
1|This is a ab_test ab_file ab_maybe
499999|This is a ab_test ab_file ab_maybe
500000|This is a ab_test ab_file ab_maybe
In above code, we read the words: test, file, maybe and changed it to ab_test, ab_file, ab_maybe in the data.txt file. head and tail count confirms our operation.
cheers, Gaurav
Upvotes: 1
Reputation: 52556
You can turn your users.txt
into sed commands like this:
$ sed 's|.*|s/&/New_&/|' users.txt
s/user1/New_user1/
s/user2/New_user2/
s/user3/New_user3/
And then use this to process data.txt
, either by writing the output of the previous command to an intermediate file, or with process substitution:
sed -f <(sed 's|.*|s/&/New_&/|' users.txt) data.txt
Your approach goes through all of data.txt
for every single line in users.txt
, which makes it slow.
If you can't use process substitution, you can use
sed 's|.*|s/&/New_&/|' users.txt | sed -f - data.txt
instead.
Upvotes: 3