Reputation: 3112
I have street names and numbers in a file, like so:
Sokolov 19, 20, 23 ,25
Hertzl 80,82,84,86
Hertzl 80a,82b,84e,90
Aba Hillel Silver 2,3,5,6,
Weizman 8
Ahad Ha'am 9 13 29
I parse the lines one by one with regex. I want a regex that will find and match:
I've come up with this mean while:
/(\D{2,})\s+(\d{1,3}[a-d|א-ד]?)(?:[,\s]{1,3})?/
It finds the street name and first number. I need to find all the numbers.
I don't want to use two separate regex's if possible, and I prefer not to use Ruby's scan
but just have it in one regex.
Upvotes: 0
Views: 325
Reputation: 70750
I want a regex that will find and match....
digits (0-9)
, other characters
beside an apostrophe?a
, b
, c
, or d
?Here are some possible options:
If you are unsure about what the street name contains, but know your street number pattern will be numbers with an optional letter, commas or spaces.
/^(.*?)\s+(\d+(?:[a-z]?[, ]+\d+)*)(?=,|$)/
See working demo
If the street names contain only letters with optional apostrophe's and the street numbers contain numbers with an optional letter, comma.
/^([a-zA-Z' ]+)\s+(\d+(?:[a-z]?[, ]+\d+)*)(?=,|$)/
See working demo
If your street name and street number pattern are always consistant, you could easily do.
/^([a-zA-Z' ]+)\s+([0-9a-z, ]+)$/
See working demo
Upvotes: 1
Reputation: 29439
The only way around the limitation that you can only capture the last instance of a repeated expression is to write your regex for a single instance and let the regex machine do the repeating for you, as occurs with the global substitute options, admittedly similar to scan. Unfortunately, in that case, you have to match for either the street name or the street number and then have no way to easily associate the captured numbers with the captured names.
Regex is great at what it does, but when you try to extend its application beyond it's natural limitations, it's not pretty. ;-)
Upvotes: 1
Reputation: 303550
You can use regex to find all the numbers, with their separators:
re = /\A(.+?)\s+((?:\d+[a-z]*[,\s]+)*\d+[a-z]*)/
txt = "Sokolov 19, 20, 23 ,25
Hertzl 80,82,84,86
Hertzl 80a,82b,84e,90
Aba Hillel Silver 2,3,5,6,
Weizman 8
Ahad Ha'am 9 13 29"
matches = txt.lines.map{ |line| line.match(re).to_a[1..-1] }
p matches
#=> [["Sokolov", "19, 20, 23 ,25"],
#=> ["Hertzl", "80,82,84,86"],
#=> ["Hertzl", "80a,82b,84e,90"],
#=> ["Aba Hillel Silver", "2,3,5,6"],
#=> ["Weizman", "8"],
#=> ["Ahad Ha'am", "9 13 29"]]
The above regex says:
\A
Starting at the front of the string(…)
Capture the result
.+?
Find one or more characters, as few as possible that make the rest of this pattern match.\s+
Followed by one or more whitespace characters (which we don't capture)(…)
Capture the result
(?:…)*
Find zero or more of what's in here, but don't capture them\d+
One or more digits (0–9)[a-z]*
Zero or more lowercase letters[,\s]+
One or more commas and/or whitespace characters\d+
Followed by one or more digits[a-z]*
And zero or more lowercase lettersHowever, if you want to break the number up into pieces you will need to use scan
or split
or the equivalent.
result = matches.map{ |name,numbers| [name,numbers.scan(/[^,\s]+/)] }
p result
#=> [["Sokolov", ["19", "20", "23", "25"]],
#=> ["Hertzl", ["80", "82", "84", "86"]],
#=> ["Hertzl", ["80a", "82b", "84e", "90"]],
#=> ["Aba Hillel Silver", ["2", "3", "5", "6"]],
#=> ["Weizman", ["8"]],
#=> ["Ahad Ha'am", ["9", "13", "29"]]]
This is because regex captures inside a repeating group do not capture each repetition. For example:
re = /((\d+) )+/
txt = "hello 11 2 3 44 5 6 77 world"
p txt.match(re)
#=> #<MatchData "11 2 3 44 5 6 77 " 1:"77 " 2:"77">
The whole regex matches the whole string, but each capture only saves the last-seen instance. In this case, the outer capture only gets "77 " and the inner capture only gets "77".
Why do you prefer not to use scan
? This is what it is made for.
Upvotes: 3
Reputation: 4010
If you want your 3rd example to work, you need to have the [a-d]
change to include the e
in the range. After changing that you can use (\D{2,})\s+(\d{1,3}[a-e]?(?:[,\s]{1,3})*)*
. Using the examples you gave I did some testing using Rubular.
Using some more groupings you can have the repetition on those last few conditions (which seem to be pretty tricky. This way the spacing and comma at the end will get caught in the repetition after consuming the space initially.
Upvotes: 1