Reputation: 5450
Given this text:
/* F004 (0309)00 */ /* field 1 */ /* field 2 */ /* F004 (0409)00 */ /* field 1 */ /* field 2 */
how do I parse it into this array:
[
["F004"],["0309"],["/* field 1 */\n/* field 2 */"],
["F004"],["0409"],["/* field 1 */\n/* field 2 */"]
]
I got code working to parse the first two items:
form = /\/\*\s+(\w+)\s+\((\d{4})\)[0]{2}\s+\*\//m
text.scan(form)
[
["F004"],["0309"],
["F004"],["0409"]
]
And here's the code where I try to parse all three and fail w/ an invalid regex error:
form = /\/\*\s+(\w+)\s+\((\d{4})\)[0]{2}\s+\*\//m
form_and_fields = /#{form}(.[^#{form}]+)/m
text.scan(form_and_fields)
form = /
\/\*\s+(\w+)\s+\((\d+)\)\d+\s+\*\/ #formId & edDate
(.+?) #fieldText
(?=\/\*\s+\w+\s+\(\d+\)\d+\s+\*\/|\Z) #stop at beginning of next form
# or the end of the string
/mx
text.scan(form)
Upvotes: 1
Views: 6127
Reputation: 2848
For what it's worth, you might find that your code ends up a bit more readable if you expanded it out and used multiple, simpler regexes. For example (untested):
transformed_lines = []
text.each_line do |line|
if line =~ /(\w|\d)+\s\(\d+)\)/
transformed_lines << [ $1, $2, "" ]
else
transformed_lines.last.last << line.strip
end
end
Better yet, consider creating a class or simple struct for storing the results so it's a little clearer what goes where:
transformed_lines << OpenStruct.new :thingy_one => $1, :thingy_two => $2, :fields => ""
...
transformed_lines.last.fields << line.strip
Upvotes: 0
Reputation: 89123
You seem to be misunderstanding how character classes (e.g. [a-f0-9]
, or [^aeiouy]
) work. /[^abcd]/
doesn't negate the pattern abcd
, it says "match any character that's not 'a'
or 'b'
or 'c'
or 'd'
".
If you want to match the negation of a pattern, use the /(?!pattern)/
construct. It's a zero-width match - meaning it doesn't actually match any characters, it matches a position.
Similar to how /^/
and /$/
match the start and end of a string, or /\b/
matches the boundary of a word. For instance: /(?!xx)/
matches every position where the pattern "xx" doesn't start.
In general then, after you use a pattern negation, you need to match some character to move forward in the string.
So to use your pattern:
form = /\/\*\s+(\w+)\s+\((\d{4})\)[0]{2}\s+\*\//m
form_and_fields = /#{form}((?:(?!#{form}).)+)/m
text.scan(form_and_fields)
From the inside out (I'll be using (?#comments)
)
(?!#{form})
negates your original pattern, so it matches any position where your original pattern can't start.(?:(?!#{form}).)+
means match one character after that, and try again, as many times as possible, but at least once. (?:(?#whatever))
is a non-capturing parentheses - good for grouping.In irb, this gives:
irb> text.scan(form_and_fields)
=> [["F004", "0309", " \n /* field 1 */ \n /* field 2 */ \n ", nil, nil], ["F004", "0409", " \n /* field 1 */ \n /* field 2 */ \n", nil, nil]]
The extra nil
s come from the capturing groups in form
that are used in the negated pattern (?!#{form})
and therefore don't capture anything on a successful match.
This could be cleaned up some:
form_and_fields = /#{form}\s*(.+?)\s*(?:(?=#{form})|\Z)/m
text.scan(form_and_fields)
Now, instead of a zero-width negative lookahead, we use a zero-width positive lookahead (?=#{form})
to match the position of the next occurrence of form
. So in this regex, we match everything until the next occurence of form
(without including that next occurence in our match). This lets us trim out some whitespace around the fields. We also have to check for the case where we hit the end of the string - /\Z/
, since that could happen too.
In irb:
irb> text.scan(form_and_fields)
=> [["F004", "0309", "/* field 1 */ \n /* field 2 */", "F004", "0409"], ["F004", "0409", "/* field 1 */ \n /* field 2 */", nil, nil]]
Note now that the last two fields are populated the first time - b/c the capturing parens in the zero-width positive lookahead matched something, even though it wasn't marked as "consumed" during the process - which is why that bit could be rematched for the second time.
Upvotes: 6
Reputation: 11262
a.scan(/\/\*\s+(\S+)\s+\((\d+)\)\d+\s+\*\/\s+(\/\*.+\*\/\s+\n\s+\/\*.+\*\/)/)
=> [["F004", "0309", "/* field 1 */ \n /* field 2 */"], ["F004", "0409", "/* field 1 */ \n /* field 2 */"]]
Upvotes: 2