whafrog
whafrog

Reputation: 25

bash regular expression test: if vs grep

I need to scan each line of a file looking for any characters above hex \x7E. The file has several million rows, so improving efficiency would be great. So far, reading each line in a while loop, this works and finds lines with invalid characters:

echo "$line" | grep -P "[\x7F-\xFF]" > /dev/null 2>&1
if [ $? -eq 0 ]; then...

But this doesn't:

if [[ "$line" =~ [\x7F-\xFF] ]]; then...

I'm assuming it would be more efficient the second way, if I could get it to work. What am I missing?

Upvotes: 1

Views: 549

Answers (3)

glenn jackman
glenn jackman

Reputation: 246799

If you're already using Perl regular expressions, you might as well use perl for the task:

perl -ne '
    if (/[\x7F-\xFF]/) {print STDERR $_} else {print}
' file > valid 2> invalid

I'd bet that's faster than a bash loop.

I suspect this would be more efficient, even though it processes the file twice:

grep  -P "[\x7F-\xFF]" file > invalid
grep -vP "[\x7F-\xFF]" file > valid

You'd want to write your grep code as

if grep -qP "[\x7F-\xFF]" <<< "$line"; then...

Upvotes: 0

that other guy
that other guy

Reputation: 123460

If you're interested in efficiency, you shouldn't write your loop in bash. You should rethink your program in terms of pipes and use efficient tools.

That said, you can do this with

LC_CTYPE=C LC_COLLATE=C
if [[ "$line" =~ [$'\x7f'-$'\xff'] ]]
then 
    echo "It contains bytes \x7F or up"
fi

Upvotes: 4

jthill
jthill

Reputation: 60275

I basically have to split the file. Valid records go to one file, invalid records go to another.

sed -n '/[^\x0-\x7e]/w badrecords
        //!          w goodrecords'

Upvotes: 1

Related Questions