Reputation: 4907
I'm a little bit rusty with regex and have the following problem. Below is my text for what I am trying to separate.
INVOICE # 2599
INVOICE 0185570
INVOICE: 1739
INVOICE- 45441
INVOICE:# 1234
INVOICE :# 5678
What I need to do is find two matches exactly. For example, I would like to get the following:
[INVOICE#, 2599]
[INVOICE, 0185570]
[INVOICE:, 1739]
[INVOICE-, 45441]
[INVOICE:#, 45441]
[INVOICE:#, 5678]
So far I'm getting into trouble with the these characters :
#
and anything else that can separate INVOICE and #.
The digits are easy. All I need is this (\d+)
however how do I get the first part? I know I need this (\w+)
but then the non word characters throw me off. Can I get a push in the right direction please?
Upvotes: 1
Views: 56
Reputation: 4709
This can solve the problem: delete(' ').scan(/\d+|\D+/)
lines = ['INVOICE # 2599', 'INVOICE 0185570', 'INVOICE: 1739', 'INVOICE- 45441', 'INVOICE:# 1234', 'INVOICE :# 5678']
lines.map{ |line| line.delete(' ').scan(/\d+|\D+/) }
output:
[
[0] [
[0] "INVOICE#",
[1] "2599"
],
[1] [
[0] "INVOICE",
[1] "0185570"
],
[2] [
[0] "INVOICE:",
[1] "1739"
],
[3] [
[0] "INVOICE-",
[1] "45441"
],
[4] [
[0] "INVOICE:#",
[1] "1234"
],
[5] [
[0] "INVOICE:#",
[1] "5678"
]
]
Upvotes: 1
Reputation: 110685
text =<<~END
INVOICE # 2599
INVOICE 0185570
INVOICE: 1739
INVOICE- 45441
INVOICE:# 1234
INVOICE :# 5678
END
text.each_line.map { |s| s.gsub(/\s+(?!\d)/,'').split }
#=> [["INVOICE#", "2599"], ["INVOICE", "0185570"], ["INVOICE:", "1739"],
# ["INVOICE-", "45441"], ["INVOICE:#", "1234"], ["INVOICE:#", "5678"]]
The regular expression used by gsub
reads, "match one or more whitespaces not followed by a digit", (?!\d)
being a negative lookahead. That is slightly different than s.gsub(/\s+(?=\D)/,'')
, "match one or more whitespaces followed by a non-digit", as the former removes the newline at the end of each line whereas the latter does not.
The steps are as follows:
enum1 = text.each_line
#=> #<Enumerator: "INVOICE # 2599\nINVOICE 0185570\nINVOICE: 1739\n
# INVOICE- 45441\nINVOICE:# 1234\nINVOICE :#5678\n":each_line>
I've used String#each_line
rather than String#lines
(or other ways to create an array of lines) to avoid the creation of a temporary array.
enum2 = enum1.map
#=> #<Enumerator: #<Enumerator: "INVOICE # 2599\nINVOICE 0185570\nINVOICE: 1739\n
# INVOICE- 45441\nINVOICE:# 1234\nINVOICE :# 5678\n":each_line>:map>
s = enum2.next
#=> "INVOICE # 2599\n"
t = s.gsub(/\s+(?!\d)/,'')
#=> "INVOICE# 2599"
t.split
#=> ["INVOICE#", "2599"]
s = enum2.next
#=> "INVOICE 0185570\n"
t = s.gsub(/\s+(?!\d)/,'')
#=> "INVOICE 0185570"
t.split
#=> ["INVOICE", "0185570"]
and so on.
Another way of doing this is to remove whitespaces not followed by a digit before the string is divided into lines:
text.gsub(/\s+(?!\d)/, '').each_line.map(&:split)
#=> [["INVOICE#", "2599"], ["INVOICE", "0185570"], ["INVOICE:", "1739"],
# ["INVOICE-", "45441"], ["INVOICE:#", "1234"], ["INVOICE:#", "5678"]]
Upvotes: 0
Reputation: 21130
You can use \D
to match non-digit. Capture both the word and the non-digits in the first group and the digits in the second group, then remove the spaces in the first capture group. Here is an example of how it might look:
text.scan(/(\w+\D+)(\d+)/).each { |group_1,| group_1.delete!(' ') }
#=> [["INVOICE#", "2599"], ["INVOICE", "0185570"], ["INVOICE:", "1739"], ["INVOICE-", "45441"], ["INVOICE:#", "1234"], ["INVOICE:#", "5678"]]
You could also use gsub!
or tr!
instead of delete!
. Replacing \D
with \W
(non-word character) would also work.
Keep in mind that \w
equals [A-Za-z0-9_]
and can also match digits and underscores.
Upvotes: 1