Reputation: 35

Performance Issue with While and Read

I have a many-line file containing commas. I want to remove all of the characters appearing after a comma from the line, including the comma. I have a bash script which does this, but it isn't fast enough.

Input:

hello world, def

Output:

hllo worl

My slow script:

#!/bin/bash

while read line; do
    values="${line#*, }"
    phrase="${line%, *}"
    echo "${phrase//[$values]}"
done < "$1"

I want to improve the performance. Any suggestions?

Upvotes: 0

Answers (3)

Qualia

Reputation: 729

An AWK solution (edited taking inspiration from @glenn jackman's perl solution):

awk -F", " '{ gsub("["$2"]",""); print $1 }' "$1"

With this sort of line processing, it's often better to use a compiled solution. I would use Haskell for its expressiveness:

-- answer.hs
import Data.List(nub, delete)
import Data.Char(isSpace)
main = interact (unlines . (map perLine) . lines)
perLine = strSetDiff . break (==',')
strSetDiff (s, ',':' ':sub) = filter (`notElem` sub)) s
strSetDiff (s, _) = s

Compile with the command ghc -O2 answer.hs.

This breaks each line into two lists s and sub on ,, removes the ", " from sub, and then filters s to remove characters that are elements of sub. If there is no comma, the result is the whole line.

This assumes a space always follows a ,. Otherwise remove the ' ': and replace notElem sub with notElem (dropWhile isSpace sub)

Time taken for an 80000 line file consisting of 10 lines repeated 8000 times:

$ time ./answer <infile >outfile
0.38s user 0.00s system 99% cpu 0.386 total

$ time [glenn jackman\'s perl]
0.68s user 0.00s system 99% cpu 0.691 total

$ time awk -F", " '{ gsub("["$2"]",""); print $1 }' infile > outfile
0.85s user 0.04s system 99% cpu 0.897 total

$  time ./ElBarajas.sh infile > outfile
2.77s user 0.32s system 99% cpu 3.105 total

Personally, I'm willing to admit defeat - the perl solution seems best to me.

Upvotes: 0

glenn jackman

Reputation: 246799

Using Perl

$ perl -F',' -lane '$F[0] =~ s/[$F[1]]//g; print $F[0]' file
hlloworl

If you don't want to count the space after the comma:

$ perl -F',\s*' -lane '$F[0] =~ s/[$F[1]]//g; print $F[0]' file
hllo worl

Perl excels at text manipulation like this, so I'd expect this to be pretty quick.

Upvotes: 1

dinox0r

Reputation: 16039

Getting rid of the while loop could give your code a boost, most programs take a file as input and will do the reading for you.

You can replace your program with the following and report the times:

cut -d"," -f1 < file

You can try with awk, changing the field separator to ,:

awk 'BEGIN {FS=","}; {print $1}' file

Also you could try with sed (with the modifications suggested by @Qualia):

sed -r -i "s/,.*//g" file

Beware though, that the -i flag will inplace edit your file, if that is not the desired effect you can just do:

sed -r "s/,.*//g" file

Upvotes: 0

Performance Issue with While and Read

Answers (3)

Related Questions