Reputation: 37
I'm starting in programming and I'm looking to make a program for extracting all the words contained between two words within a text (in order store them in a variable )
For example with the words "START" & "STOP": "START 1 2 3 STOP 5 6 START 7 8 STOP 9 10"
I would like to store in variables: 1 2 3 7 8
I started to do it with Ruby as you can see in the code below, my current idea was to convert the string "global" into an array and then number the position of string1 and string2; then create an array ‘string1’ with the values of the initial array # string1 + 1,… string2 -1. Unfortunately, it works only once because the .index function only works on the first occurence...would there be a better way to do that ?
Thank you in advance for your help
text = "0 start 2 3 4 stop 6 7 start 9 10 stop 12"
start= text.split(' ')
a = start.index('start')
b = start.index('stop')
puts a
puts b
puts c = start[a+1,b-a-1].join(" ")
# returns
#1
#5
#2 3 4 ```
Upvotes: 2
Views: 313
Reputation: 11193
Option using array: as a starting point I could suggest using Enumerable#slice_before after String#split
Given your command and the stop-words:
command = "START 1 2 3 STOP 5 6 START 7 8 STOP 9 10"
start = 'START'
stop = 'STOP'
You can use it something like that:
grouped_cmd = command.split.slice_before { |e| [start, stop].include? e } # .to_a
#=> [["START", "1", "2", "3"], ["STOP", "5", "6"], ["START", "7", "8"], ["STOP", "9", "10"]]
Then you can manipulate as you like, for example:
grouped_cmd.select { |first, *rest| first == start }
#=> [["START", "1", "2", "3"], ["START", "7", "8"]]
Or
grouped_cmd.each_with_object([]) { |(first, *rest), ary| ary << rest if first == start }
#=> [["1", "2", "3"], ["7", "8"]]
Or even
grouped_cmd.each_slice(2).map { |(start, *stt), (stop, *stp)| { start.downcase.to_sym => stt, stop.downcase.to_sym => stp } }
#=> [{:start=>["1", "2", "3"], :stop=>["5", "6"]}, {:start=>["7", "8"], :stop=>["9", "10"]}]
And so on.
Upvotes: 1
Reputation: 84373
Here's an approach based on String#scan:
text = "0 start 2 3 4 stop 6 7 start 9 10 stop 12"
text.scan(/\bstart\s+(.*?)\s+stop\b/i).flat_map { _1.flat_map &:split }
#=> ["2", "3", "4", "9", "10"]
The idea here is to:
Extract all string segments that are bracketed between case-insensitive start
and stop
keywords.
text.scan /\bstart\s+(.*?)\s+stop\b/i
#=> [["2 3 4"], ["9 10"]]
Extract words separated by whitespace from between your keywords.
[["2 3 4"], ["9 10"]].flat_map { _1.flat_map &:split }
#=> ["2", "3", "4", "9", "10"]
Notable caveats to the approach outlined above include:
\b
is a zero-width assertion, so looking for word boundaries can cause #scan to include leading and trailing whitespace in the results that then need to be handled by String#strip or String#split.\s+
for \b
handles some edge cases while creating others. "start 0 start 2 3 4 stop 6 stop"
.For simple use cases, String#scan with a tuned regex is probably all you need. The more varied and unpredictable your input and data structures are, the more edge cases your parsing routines will need to handle.
Upvotes: 1
Reputation: 27855
You could start with the scan
-method and a regular expression:
text = "0 start 2 3 4 stop 6 7 start 9 10 stop 12"
res1 = text.scan(/start\s*(.*?)\s*stop/) #[["2 3 4"], ["9 10"]]
res2 = res1.flatten #["2 3 4", "9 10"]
or without the intermediate variables:
res = text.scan(/start(.*?)stop/).flatten #["2 3 4", "9 10"]
Explanation:
See https://apidock.com/ruby/String/scan for the scan method.
The regular expression /start\s*(.*?)\s*stop/
is the combination of
\s*
: any space character(.*?)
:
(
and )
is responsible to remember the content..
means any character, *
means a repetition (zero or more characters), ?
restrict the result to the shortest possibility (see below for details)\s*
: any space character
stop
The result is an array with hits of the regular expression. The regular expression could contain different parts to detect (multiple ()
-pairs). So it is an array of arrays. In our case, each inner array has one element, so you can use flatten
to get a 'flat' array.
If you would not use the ?
in the regular expression, then you would find 2 3 4 stop 6 7 start 9 10
instead of the shorter parts.
Upvotes: 2
Reputation: 1238
You are not exactly getting an error, codereview might be a better place to ask. But since you are new in the community, here is a regular expression with lookaround assertions that does the job:
text = "0 start 2 3 4 stop 6 7 start 9 10 stop 12"
text.scan(/start ((?:(?!start).)*?) stop/).join(' ')
# => "2 3 4 9 10"
Btw, a great place to test you regular expressions in Ruby is https://rubular.com/
I hope you find this helpful.
Upvotes: 1