Reputation: 325

Character offset with grep

When I run this on my bash terminal:

grep -ob "amilase" <<< "α-amilase"

I get this:

3:amilase

Let's define byte offset as the number of bytes before the matching word and character offset as the number of user-visible characters before the matching word.

Above, the 3 corresponds to the byte offset of the matching word. α is an unicode character that occupies 2 bytes, that's why we get the 3.

But how can I get the character offset, that in this case would be 2? Why 2? If you look at the screen and count how many visible symbols exist before the matching word, you would count 2.

I'm looking for a solution that behaves like grep -o, that is, if there is more than one match per line, they all are reported.

Upvotes: 4

Answers (2)

Fred

Reputation: 6995

You can whip up your own poor-man's grep!

Put this in a script named, say, mygrep :

#!/bin/bash

# Takes extended regex as first argument
# Text to match received on standard in
if
  [[ $# != 1 ]]
then
 echo "This script takes one argument as input"
 exit 1
fi

while IFS= read -r LINE
do
  while true
  do
    [[ "$LINE" =~ ^(.*)($1)(.*)$ ]] || break
    echo "${#BASH_REMATCH[1]}:$1"
    LINE="${BASH_REMATCH[1]}"
  done | tac
done

Then simply replace grep in your command :

mygrep "amilase" <<< "α-amilase"

The script loops over all lines of input, matches each one with the regex received as an argument (it can be a simple string, but you have the full power of regular expressions to your fingertips if needed). I have updated my answer to allow for multiple matches on a single line. The | tac is used to reverse the order of lines, as the greedy matching matches the last occurrence on each line first : if you do not mind the matches appearing in reverse, just remove | tac.

I am not sure the output is what you would want, and you can customize it easily.

Please note the =~ operator does pattern matching (egrep syntax), and BASH_REMATCH is an array used to access (starting at index 1) the sub-expressions inside parentheses.

Upvotes: 4

Kent

Reputation: 195039

if you want a 0-based index of match, you can try this with awk:

awk '{print index($0,"amil")-1}'<<< "α-amilase"    
2

awk '{print index($0,"amil")-1}'<<< "fooamilα-whatever"        
3

If no match found, it prints -1

Upvotes: 1

Character offset with grep

Answers (2)

Related Questions