I have street names and numbers in a file, like so: Sokolov 19, 20, 23 ,25 Hertzl 80,82,84,86 Hertzl 80a,82b,84e,90 Aba Hillel Silver 2,3,5,6, Weizman 8 Ahad Ha'am 9 13 29 I parse the lines one by one with regex. I want a regex that will find and match: The name of the street, The street numbers with its possible a,b,c,d attached. I've come up with this mean while: /(\D{2,})\s+(\d{1,3}[a-d|א-ד]?)(?:[,\s]{1,3})?/ It finds the street name and first number. I need to find all the numbers. I don't want to use two separate regex's if possible, and I prefer not to use Ruby's scan but just have it in one regex.

You can use regex to find all the numbers, with their separators: re = /\A(.+?)\s+((?:\d+[a-z]*[,\s]+)*\d+[a-z]*)/ txt = "Sokolov 19, 20, 23 ,25 Hertzl 80,82,84,86 Hertzl 80a,82b,84e,90 Aba Hillel Silver 2,3,5,6, Weizman 8 Ahad Ha'am 9 13 29" matches = txt.lines.map{ |line| line.match(re).to_a[1..-1] } p matches #=> [["Sokolov", "19, 20, 23 ,25"], #=> ["Hertzl", "80,82,84,86"], #=> ["Hertzl", "80a,82b,84e,90"], #=> ["Aba Hillel Silver", "2,3,5,6"], #=> ["Weizman", "8"], #=> ["Ahad Ha'am", "9 13 29"]] The above regex says: \A Starting at the front of the string (…) Capture the result .+? Find one or more characters, as few as possible that make the rest of this pattern match. \s+ Followed by one or more whitespace characters (which we don't capture) (…) Capture the result (?:…)* Find zero or more of what's in here, but don't capture them \d+ One or more digits (0–9) [a-z]* Zero or more lowercase letters [,\s]+ One or more commas and/or whitespace characters \d+ Followed by one or more digits [a-z]* And zero or more lowercase letters However, if you want to break the number up into pieces you will need to use scan or split or the equivalent. result = matches.map{ |name,numbers| [name,numbers.scan(/[^,\s]+/)] } p result #=> [["Sokolov", ["19", "20", "23", "25"]], #=> ["Hertzl", ["80", "82", "84", "86"]], #=> ["Hertzl", ["80a", "82b", "84e", "90"]], #=> ["Aba Hillel Silver", ["2", "3", "5", "6"]], #=> ["Weizman", ["8"]], #=> ["Ahad Ha'am", ["9", "13", "29"]]] This is because regex captures inside a repeating group do not capture each repetition. For example: re = /((\d+) )+/ txt = "hello 11 2 3 44 5 6 77 world" p txt.match(re) #=> #<MatchData "11 2 3 44 5 6 77 " 1:"77 " 2:"77"> The whole regex matches the whole string, but each capture only saves the last-seen instance. In this case, the outer capture only gets "77 " and the inner capture only gets "77". Why do you prefer not to use scan ? This is what it is made for.

rubyregex

mjnissim

Reputation: 3112

Matching repeated pattern in string

I have street names and numbers in a file, like so:

Sokolov 19, 20, 23 ,25
Hertzl 80,82,84,86
Hertzl 80a,82b,84e,90
Aba Hillel Silver 2,3,5,6,
Weizman 8
Ahad Ha'am 9 13 29

I parse the lines one by one with regex. I want a regex that will find and match:

The name of the street,
The street numbers with its possible a,b,c,d attached.

I've come up with this mean while:

/(\D{2,})\s+(\d{1,3}[a-d|א-ד]?)(?:[,\s]{1,3})?/

It finds the street name and first number. I need to find all the numbers.

I don't want to use two separate regex's if possible, and I prefer not to use Ruby's scan but just have it in one regex.

Upvotes: 0

Answers (4)

hwnd

Reputation: 70750

I want a regex that will find and match....

Do the street names also contain digits (0-9), other characters beside an apostrophe?
Are the street numbers based off arbitrary data? Is it always just an optional a, b, c, or d?
Are you needing a minimum and maximum limitation of string length?

Here are some possible options:

If you are unsure about what the street name contains, but know your street number pattern will be numbers with an optional letter, commas or spaces.

/^(.*?)\s+(\d+(?:[a-z]?[, ]+\d+)*)(?=,|$)/

See working demo

If the street names contain only letters with optional apostrophe's and the street numbers contain numbers with an optional letter, comma.

/^([a-zA-Z' ]+)\s+(\d+(?:[a-z]?[, ]+\d+)*)(?=,|$)/

See working demo

If your street name and street number pattern are always consistant, you could easily do.

/^([a-zA-Z' ]+)\s+([0-9a-z, ]+)$/

See working demo

Upvotes: 1

Peter Alfvin

Reputation: 29439

The only way around the limitation that you can only capture the last instance of a repeated expression is to write your regex for a single instance and let the regex machine do the repeating for you, as occurs with the global substitute options, admittedly similar to scan. Unfortunately, in that case, you have to match for either the street name or the street number and then have no way to easily associate the captured numbers with the captured names.

Regex is great at what it does, but when you try to extend its application beyond it's natural limitations, it's not pretty. ;-)

Upvotes: 1

Phrogz

Reputation: 303550

You can use regex to find all the numbers, with their separators:

re = /\A(.+?)\s+((?:\d+[a-z]*[,\s]+)*\d+[a-z]*)/

txt = "Sokolov 19, 20, 23 ,25
Hertzl 80,82,84,86
Hertzl 80a,82b,84e,90
Aba Hillel Silver 2,3,5,6,
Weizman 8
Ahad Ha'am 9 13 29"

matches = txt.lines.map{ |line| line.match(re).to_a[1..-1] }
p matches
#=> [["Sokolov", "19, 20, 23 ,25"],
#=>  ["Hertzl", "80,82,84,86"],
#=>  ["Hertzl", "80a,82b,84e,90"],
#=>  ["Aba Hillel Silver", "2,3,5,6"],
#=>  ["Weizman", "8"],
#=>  ["Ahad Ha'am", "9 13 29"]]

The above regex says:

\A Starting at the front of the string
(…) Capture the result
- .+? Find one or more characters, as few as possible that make the rest of this pattern match.
\s+ Followed by one or more whitespace characters (which we don't capture)
(…) Capture the result
- (?:…)* Find zero or more of what's in here, but don't capture them
- \d+ One or more digits (0–9)
- [a-z]* Zero or more lowercase letters
- [,\s]+ One or more commas and/or whitespace characters
- \d+ Followed by one or more digits
- [a-z]* And zero or more lowercase letters

However, if you want to break the number up into pieces you will need to use scan or split or the equivalent.

result = matches.map{ |name,numbers| [name,numbers.scan(/[^,\s]+/)] }
p result
#=> [["Sokolov", ["19", "20", "23", "25"]],
#=>  ["Hertzl", ["80", "82", "84", "86"]],
#=>  ["Hertzl", ["80a", "82b", "84e", "90"]],
#=>  ["Aba Hillel Silver", ["2", "3", "5", "6"]],
#=>  ["Weizman", ["8"]],
#=>  ["Ahad Ha'am", ["9", "13", "29"]]]

This is because regex captures inside a repeating group do not capture each repetition. For example:

re = /((\d+) )+/
txt = "hello 11 2 3 44 5 6 77 world"

p txt.match(re)
#=> #<MatchData "11 2 3 44 5 6 77 " 1:"77 " 2:"77">

The whole regex matches the whole string, but each capture only saves the last-seen instance. In this case, the outer capture only gets "77 " and the inner capture only gets "77".

Why do you prefer not to use scan? This is what it is made for.

Upvotes: 3

Walls

Reputation: 4010

If you want your 3rd example to work, you need to have the [a-d] change to include the e in the range. After changing that you can use (\D{2,})\s+(\d{1,3}[a-e]?(?:[,\s]{1,3})*)*. Using the examples you gave I did some testing using Rubular.

Using some more groupings you can have the repetition on those last few conditions (which seem to be pretty tricky. This way the spacing and comma at the end will get caught in the repetition after consuming the space initially.

Upvotes: 1

Matching repeated pattern in string

Answers (4)

Related Questions