Reputation: 325
When I run this on my bash terminal:
grep -ob "amilase" <<< "α-amilase"
I get this:
3:amilase
Let's define byte offset as the number of bytes before the matching word and character offset as the number of user-visible characters before the matching word.
Above, the 3 corresponds to the byte offset of the matching word. α is an unicode character that occupies 2 bytes, that's why we get the 3.
But how can I get the character offset, that in this case would be 2? Why 2? If you look at the screen and count how many visible symbols exist before the matching word, you would count 2.
I'm looking for a solution that behaves like grep -o, that is, if there is more than one match per line, they all are reported.
Upvotes: 4
Views: 2128
Reputation: 6995
You can whip up your own poor-man's grep!
Put this in a script named, say, mygrep
:
#!/bin/bash
# Takes extended regex as first argument
# Text to match received on standard in
if
[[ $# != 1 ]]
then
echo "This script takes one argument as input"
exit 1
fi
while IFS= read -r LINE
do
while true
do
[[ "$LINE" =~ ^(.*)($1)(.*)$ ]] || break
echo "${#BASH_REMATCH[1]}:$1"
LINE="${BASH_REMATCH[1]}"
done | tac
done
Then simply replace grep
in your command :
mygrep "amilase" <<< "α-amilase"
The script loops over all lines of input, matches each one with the regex received as an argument (it can be a simple string, but you have the full power of regular expressions to your fingertips if needed). I have updated my answer to allow for multiple matches on a single line. The | tac
is used to reverse the order of lines, as the greedy matching matches the last occurrence on each line first : if you do not mind the matches appearing in reverse, just remove | tac
.
I am not sure the output is what you would want, and you can customize it easily.
Please note the =~
operator does pattern matching (egrep syntax), and BASH_REMATCH is an array used to access (starting at index 1) the sub-expressions inside parentheses.
Upvotes: 4
Reputation: 195039
if you want a 0-based index of match, you can try this with awk:
awk '{print index($0,"amil")-1}'<<< "α-amilase"
2
awk '{print index($0,"amil")-1}'<<< "fooamilα-whatever"
3
If no match found, it prints -1
Upvotes: 1