Lou
Lou

Reputation: 2509

Extracting a string from a substring in bash (yes, that way around)

I have a string of several words in bash called comp_line, which can have any number of spaces inside. For example:

"foo bar   apple  banana q xy"

And I have a zero-based index comp_point pointing to one character in that string, e.g. if comp_point is 4, it points to the first 'b' in 'bar'.

Based on the comp_point and comp_line alone, I want to extract the word being pointed to by the index, where the "word" is a sequence of letters, numbers, punctuation or any other non-whitespace character, surrounded by whitespace on either side (if the word is at the start or end of the string, or is the only word in the string, it should work the same way.)

The word I'm trying to extract will become cur (the current word)

Based on this, I've come up with a set of rules:

Read the current character curchar, the previous character prevchar, and the next character nextchar. Then:

  1. If curchar is a graph character (non-whitespace), set cur to the letters before and after curchar (stopping until you reach a whitespace or string start/end on either side.)

  2. Else, if prevchar is a graph character, set cur to the letters from the previous letter, backwards until the previous whitespace character/string start.

  3. Else, if nextchar is a graph character, set cur to the letters from the next letter, forwards until the next whitespace character/string end.

  4. If none of the above conditions are hit (meaning curchar, nextchar and prevchar are all whitespace characters,) set cur to "" (empty string)

I've written some code which I think achieves this. Rules 2, 3 and 4 are relatively straightforward, but rule 1 is the most difficult to implement - I've had to do some complicated string slicing. I'm not convinced that my solution is in any way ideal, and want to know if anyone knows of a better way to do this within bash only (not outsourcing to Python or another easier language.)

Tested on https://rextester.com/l/bash_online_compiler

#!/bin/bash
# GNU bash, version 4.4.20

comp_line="foo bar   apple  banana q xy"
comp_point=19
cur=""

curchar=${comp_line:$comp_point:1}
prevchar=${comp_line:$((comp_point - 1)):1}
nextchar=${comp_line:$((comp_point + 1)):1}
echo "<$prevchar> <$curchar> <$nextchar>"

if [[ $curchar =~ [[:graph:]] ]]; then
    # Rule 1 - Extract current word
    slice="${comp_line:$comp_point}"
    endslice="${slice%% *}"
    slice="${slice#"$endslice"}"
    slice="${comp_line%"$slice"}"
    cur="${slice##* }"
else
    if [[ $prevchar =~ [[:graph:]] ]]; then
        # Rule 2 - Extract previous word
        slice="${comp_line::$comp_point}"
        cur="${slice##* }"
    else
        if [[ $nextchar =~ [[:graph:]] ]]; then
            # Rule 3 - Extract next word
            slice="${comp_line:$comp_point+1}"
            cur="${slice%% *}"
        else
            # Rule 4 - Set cur to empty string ""
            cur=""
        fi
    fi
fi

echo "Cur: <$cur>"

The current example will return 'banana' as comp_point is set to 19.

I'm sure that there must be a neater way to do it that I hadn't thought of, or some trick that I've missed. Also it works so far, but I think there may be some edge cases I hadn't thought of. Can anyone advise if there's a better way to do it?


(The XY problem, if anyone asks)

I'm writing a tab completion script, and trying to emulate the functionality of COMP_WORDS and COMP_CWORD, using COMP_LINE and COMP_POINT. When a user is typing a command to tab complete, I want to work out which word they are trying to tab complete just based on the latter two variables. I don't want to outsource this code to Python because performance takes a big hit when Python is involved in tab complete.

Upvotes: 0

Views: 253

Answers (2)

ctac_
ctac_

Reputation: 2471

Another way in bash without array.

#!/bin/bash

string="foo bar   apple  banana q xy"

wordAtIndex() {
  local index=$1 string=$2 ret='' last first
  if [ "${string:index:1}" != " " ] ; then
    last="${string:index}"
    first="${string:0:index}"
    ret="${first##* }${last%% *}"
  fi
  echo "$ret"
}

for ((i=0; i < "${#string}"; ++i)); do
 printf '%s <-- "%s"\n' "${string:i:1}" "$(wordAtIndex "$i" "$string")"
done

Upvotes: 2

Socowi
Socowi

Reputation: 27215

if anyone knows of a better way to do this within bash only

Use regexes. With ^.{4} you can skip the first four letters to navigate to index 4. With [[:graph:]]* you can match the rest of the word at that index. * is greedy and will match as many graphical characters as possible.

wordAtIndex() {
  local index=$1 string=$2 left right indexFromRight
  [[ "$string" =~ ^.{$index}([[:graph:]]*) ]]
  right=${BASH_REMATCH[1]}
  ((indexFromRight=${#string}-index-1))
  [[ "$string" =~ ([[:graph:]]*).{$indexFromRight}$ ]]
  left=${BASH_REMATCH[1]}
  echo "$left${right:1}"
}

And here is full test for your example:

string="foo bar   apple  banana q xy"
for ((i=0; i < "${#string}"; ++i)); do
  printf '%s <-- "%s"\n' "${string:i:1}" "$(wordAtIndex "$i" "$string")"
done

This outputs the input string vertically on the left, and on each index extracts the word that index points to on the right.

f <-- "foo"
o <-- "foo"
o <-- "foo"
  <-- ""
b <-- "bar"
a <-- "bar"
r <-- "bar"
  <-- ""
  <-- ""
  <-- ""
a <-- "apple"
p <-- "apple"
p <-- "apple"
l <-- "apple"
e <-- "apple"
  <-- ""
  <-- ""
b <-- "banana"
a <-- "banana"
n <-- "banana"
a <-- "banana"
n <-- "banana"
a <-- "banana"
  <-- ""
q <-- "q"
  <-- ""
x <-- "xy"
y <-- "xy"

Upvotes: 1

Related Questions