Reputation: 35
I have a many-line file containing commas. I want to remove all of the characters appearing after a comma from the line, including the comma. I have a bash script which does this, but it isn't fast enough.
Input:
hello world, def
Output:
hllo worl
My slow script:
#!/bin/bash
while read line; do
values="${line#*, }"
phrase="${line%, *}"
echo "${phrase//[$values]}"
done < "$1"
I want to improve the performance. Any suggestions?
Upvotes: 0
Views: 88
Reputation: 729
An AWK solution (edited taking inspiration from @glenn jackman's perl solution):
awk -F", " '{ gsub("["$2"]",""); print $1 }' "$1"
With this sort of line processing, it's often better to use a compiled solution. I would use Haskell for its expressiveness:
-- answer.hs
import Data.List(nub, delete)
import Data.Char(isSpace)
main = interact (unlines . (map perLine) . lines)
perLine = strSetDiff . break (==',')
strSetDiff (s, ',':' ':sub) = filter (`notElem` sub)) s
strSetDiff (s, _) = s
Compile with the command ghc -O2 answer.hs
.
This break
s each line into two lists s
and sub
on ,
, removes the ", "
from sub
, and then filters s
to remove characters that are elements of sub
. If there is no comma, the result is the whole line.
This assumes a space always follows a ,
. Otherwise remove the ' ':
and replace notElem sub
with notElem (dropWhile isSpace sub)
Time taken for an 80000 line file consisting of 10 lines repeated 8000 times:
$ time ./answer <infile >outfile
0.38s user 0.00s system 99% cpu 0.386 total
$ time [glenn jackman\'s perl]
0.68s user 0.00s system 99% cpu 0.691 total
$ time awk -F", " '{ gsub("["$2"]",""); print $1 }' infile > outfile
0.85s user 0.04s system 99% cpu 0.897 total
$ time ./ElBarajas.sh infile > outfile
2.77s user 0.32s system 99% cpu 3.105 total
Personally, I'm willing to admit defeat - the perl solution seems best to me.
Upvotes: 0
Reputation: 246799
Using Perl
$ perl -F',' -lane '$F[0] =~ s/[$F[1]]//g; print $F[0]' file
hlloworl
If you don't want to count the space after the comma:
$ perl -F',\s*' -lane '$F[0] =~ s/[$F[1]]//g; print $F[0]' file
hllo worl
Perl excels at text manipulation like this, so I'd expect this to be pretty quick.
Upvotes: 1
Reputation: 16039
Getting rid of the while
loop could give your code a boost, most programs take a file as input and will do the reading for you.
You can replace your program with the following and report the times:
cut -d"," -f1 < file
You can try with awk
, changing the field separator to ,
:
awk 'BEGIN {FS=","}; {print $1}' file
Also you could try with sed
(with the modifications suggested by @Qualia):
sed -r -i "s/,.*//g" file
Beware though, that the -i
flag will inplace edit your file, if that is not the desired effect you can just do:
sed -r "s/,.*//g" file
Upvotes: 0