MarcRF
MarcRF

Reputation: 37

Is there ruby methods to select string between other strings?

I'm starting in programming and I'm looking to make a program for extracting all the words contained between two words within a text (in order store them in a variable )

For example with the words "START" & "STOP": "START 1 2 3 STOP 5 6 START 7 8 STOP 9 10"

I would like to store in variables: 1 2 3 7 8

I started to do it with Ruby as you can see in the code below, my current idea was to convert the string "global" into an array and then number the position of string1 and string2; then create an array ‘string1’ with the values of the initial array # string1 + 1,… string2 -1. Unfortunately, it works only once because the .index function only works on the first occurence...would there be a better way to do that ?

Thank you in advance for your help

text = "0 start 2 3 4 stop 6 7 start 9 10 stop 12"

start= text.split(' ')

a = start.index('start')
b = start.index('stop')

puts a
puts b
puts c = start[a+1,b-a-1].join(" ")

# returns 
#1
#5
#2 3 4 ```





Upvotes: 2

Views: 313

Answers (4)

iGian
iGian

Reputation: 11193

Option using array: as a starting point I could suggest using Enumerable#slice_before after String#split

Given your command and the stop-words:

command = "START 1 2 3 STOP 5 6 START 7 8 STOP 9 10"

start = 'START'
stop = 'STOP'

You can use it something like that:

grouped_cmd = command.split.slice_before { |e| [start, stop].include? e } # .to_a
#=> [["START", "1", "2", "3"], ["STOP", "5", "6"], ["START", "7", "8"], ["STOP", "9", "10"]]

Then you can manipulate as you like, for example:

grouped_cmd.select { |first, *rest| first == start }
#=> [["START", "1", "2", "3"], ["START", "7", "8"]]

Or

grouped_cmd.each_with_object([]) { |(first, *rest), ary| ary << rest if first == start }
#=> [["1", "2", "3"], ["7", "8"]]

Or even

grouped_cmd.each_slice(2).map { |(start, *stt), (stop, *stp)| { start.downcase.to_sym => stt, stop.downcase.to_sym => stp } }
#=> [{:start=>["1", "2", "3"], :stop=>["5", "6"]}, {:start=>["7", "8"], :stop=>["9", "10"]}]

And so on.

Upvotes: 1

Todd A. Jacobs
Todd A. Jacobs

Reputation: 84373

A One-Line Method Chain

Here's an approach based on String#scan:

text = "0 start 2 3 4 stop 6 7 start 9 10 stop 12"
text.scan(/\bstart\s+(.*?)\s+stop\b/i).flat_map { _1.flat_map &:split }
#=> ["2", "3", "4", "9", "10"]

The idea here is to:

  1. Extract all string segments that are bracketed between case-insensitive start and stop keywords.

    text.scan /\bstart\s+(.*?)\s+stop\b/i
    #=> [["2 3 4"], ["9 10"]]
    
  2. Extract words separated by whitespace from between your keywords.

    [["2 3 4"], ["9 10"]].flat_map { _1.flat_map &:split }
    #=> ["2", "3", "4", "9", "10"]
    

Caveats

Notable caveats to the approach outlined above include:

  • String#scan creates nested arrays, and the repeated calls to Enumerable#flat_map used to handle them are less elegant than I might prefer.
  • \b is a zero-width assertion, so looking for word boundaries can cause #scan to include leading and trailing whitespace in the results that then need to be handled by String#strip or String#split.
  • Substituting \s+ for \b handles some edge cases while creating others.
  • It doesn't do anything to guard against unbalanced pairs, e.g. "start 0 start 2 3 4 stop 6 stop".

For simple use cases, String#scan with a tuned regex is probably all you need. The more varied and unpredictable your input and data structures are, the more edge cases your parsing routines will need to handle.

Upvotes: 1

knut
knut

Reputation: 27855

You could start with the scan-method and a regular expression:

text = "0 start 2 3 4 stop 6 7 start 9 10 stop 12"
res1 = text.scan(/start\s*(.*?)\s*stop/) #[["2 3 4"], ["9 10"]]
res2 = res1.flatten #["2 3 4", "9 10"]

or without the intermediate variables:

res = text.scan(/start(.*?)stop/).flatten #["2 3 4", "9 10"]

Explanation:

See https://apidock.com/ruby/String/scan for the scan method.

The regular expression /start\s*(.*?)\s*stop/ is the combination of

  1. start
  2. \s*: any space character
  3. (.*?):

    1. The (and ) is responsible to remember the content.
    2. . means any character, * means a repetition (zero or more characters), ? restrict the result to the shortest possibility (see below for details)
  4. \s*: any space character

  5. stop

The result is an array with hits of the regular expression. The regular expression could contain different parts to detect (multiple ()-pairs). So it is an array of arrays. In our case, each inner array has one element, so you can use flatten to get a 'flat' array.

If you would not use the ? in the regular expression, then you would find 2 3 4 stop 6 7 start 9 10 instead of the shorter parts.

Upvotes: 2

wteuber
wteuber

Reputation: 1238

You are not exactly getting an error, codereview might be a better place to ask. But since you are new in the community, here is a regular expression with lookaround assertions that does the job:

text = "0 start 2 3 4 stop 6 7 start 9 10 stop 12"
text.scan(/start ((?:(?!start).)*?) stop/).join(' ')
# => "2 3 4 9 10"

Btw, a great place to test you regular expressions in Ruby is https://rubular.com/

I hope you find this helpful.

Upvotes: 1

Related Questions