Duncan Malashock
Duncan Malashock

Reputation: 786

Ruby regex to match substrings between character patterns and newlines

I have data that's formatted this way, as a single string:

"1. Enloe Medical Center - 2,000 
2. CSU Chico - 1,805 
3. Walmart Distribution Center - 1,350 
4. Pacific Coast Producers (Agribusiness) - 1,200 
5. Marysville School District - 1,000 
6. Feather River Hospital - 865 
7. Sunsweet Growers (Agriculture) - 600 
8. YRC (Freight Services) - 500 
9. Sierra Pacific Industries (Lumber Products) - 500 
10. Colusa Casino Resort - 500"

In a Ruby app, I'd like to create two arrays: one of the substrings between each numbered list marker and the dash, and one of the substrings containing the numbers between the dash and the newlines (as integers), like so:

labels = ["Enloe Medical Center","CSU Chico","Walmart Distribution Center","Pacific Coast Producers (Agribusiness)","Marysville School District","Feather River Hospital","Sunsweet Growers (Agriculture)","YRC (Freight Services)","Sierra Pacific Industries (Lumber Products)","Colusa Casino Resort"]

numbers = [2000, 1805, 1350, 1200, 1000, 865, 600, 500, 500, 500]

I'm not so great with my regexes; I know how to do substitutions and matching, but I'm not sure where to start with this. Can anyone help?

Upvotes: 0

Views: 231

Answers (5)

sawa
sawa

Reputation: 168081

labels, numbers = string.scan(/^\s*\d+\.\s+(.+)\s+-\s+([\d,]+)\s*$/).transpose
numbers.map!{|s| s.gsub(",", "").to_i}

Upvotes: 3

Darek Nędza
Darek Nędza

Reputation: 1420

One thing that makes it easy:

/pat/m - Treat a newline as a character matched by .

Other thing is grouping(example in 2nd part).

You write regexp for 1 line, and it fits whole string:

r1 = /\d+\,\d+\s*$/m
str.scan r1
["2,000 ", "1,805 ", "1,350 ", "1,200 ", "1,000 "]

$ matches end of line
\d number
+ how many times-> one or more
\s space(0 or more times)
ps. since you know how to substitute I haven't changed it to numbers

r2 = /\d+\.\s*([\w\s]+)\s*\-/m
 str.scan(r2).flatten

\d+ - matches number 1 or more times
\. - matches . - you must escape it because . matches any character
s* - spaces 0 or more
[\w\s]+ - any word character or space, 1 or more times
() - you are grouping, and it's easy way to say I want this surrounded by this, more here: regexp ruby - capturing

Upvotes: 1

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89547

You can do this:

rawlines = <<EOF
1. Enloe Medical Center - 2,000 
2. CSU Chico - 1,805 
3. Walmart Distribution Center - 1,350 
4. Pacific Coast Producers (Agribusiness) - 1,200 
5. Marysville School District - 1,000 
6. Feather River Hospital - 865 
7. Sunsweet Growers (Agriculture) - 600 
8. YRC (Freight Services) - 500 
9. Sierra Pacific Industries (Lumber Products) - 500 
10. Colusa Casino Resort - 500
EOF
labels = []
numbers = []
rawlines.scan(/^[0-9]+\. ([^-]+) - ([1-9][0-9]{0,2}(?>,[0-9]{3})*)/) do |label, number|
  labels << label
  numbers << number.gsub(",", "")
end
puts labels
puts numbers

Note that this part of the pattern ([1-9][0-9]{0,2}(?>,[0-9]{3})*) can be replaced by ([0-9,]+)

Upvotes: 0

bjhaid
bjhaid

Reputation: 9752

str = %{1. Enloe Medical Center - 2,000
2. CSU Chico - 1,805
3. Walmart Distribution Center - 1,350
4. Pacific Coast Producers (Agribusiness) - 1,200
5. Marysville School District - 1,000
6. Feather River Hospital - 865
7. Sunsweet Growers (Agriculture) - 600
8. YRC (Freight Services) - 500
9. Sierra Pacific Industries (Lumber Products) - 500
10. Colusa Casino Resort - 500}

numbers = str.scan(/-\ (\d.*)$/).flatten.map{|s| s.gsub(",", "").to_i} # => [2000, 1805, 1350, 1200, 1000, 865, 600, 500, 500, 500] # !> assigned but unused variable - numbers
labels = str.scan(/\d+\.\s(.*)\s-/).flatten # => ["Enloe Medical Center", "CSU Chico", "Walmart Distribution Center", "Pacific Coast Producers (Agribusiness)", "Marysville School District", "Feather River Hospital", "Sunsweet Growers (Agriculture)", "YRC (Freight Services)", "Sierra Pacific Industries (Lumber Products)", "Colusa Casino Resort"] # !> assigned but unused variable - labels

Upvotes: 0

devanand
devanand

Reputation: 5290

s = "1. Enloe Medical Center - 2,000 
 2. CSU Chico - 1,805 
 3. Walmart Distribution Center - 1,350 
 4. Pacific Coast Producers (Agribusiness) - 1,200 
 5. Marysville School District - 1,000 
 6. Feather River Hospital - 865 
 7. Sunsweet Growers (Agriculture) - 600 
 8. YRC (Freight Services) - 500 
 9. Sierra Pacific Industries (Lumber Products) - 500 
10. Colusa Casino Resort - 500"

arr1 = s.each_line.map { | x | 
  x.match(/- (.*)/)[ 1 ].gsub(/[^0-9]*/,'')
}

arr2 = s.each_line.map { | x | 
  x.match(/\d. (.*) - (.*)/)[ 1 ]
}

puts arr1
puts arr2

Upvotes: 0

Related Questions