Dan Rubio
Dan Rubio

Reputation: 4907

How to refine the separation of chars from digits in regex expression

I'm a little bit rusty with regex and have the following problem. Below is my text for what I am trying to separate.

INVOICE # 2599
INVOICE 0185570
INVOICE: 1739
INVOICE- 45441
INVOICE:# 1234
INVOICE :# 5678

What I need to do is find two matches exactly. For example, I would like to get the following:

[INVOICE#, 2599]
[INVOICE, 0185570]
[INVOICE:, 1739]
[INVOICE-, 45441]
[INVOICE:#, 45441]
[INVOICE:#, 5678]

So far I'm getting into trouble with the these characters : # and anything else that can separate INVOICE and #.

The digits are easy. All I need is this (\d+) however how do I get the first part? I know I need this (\w+) but then the non word characters throw me off. Can I get a push in the right direction please?

Upvotes: 1

Views: 56

Answers (3)

demir
demir

Reputation: 4709

This can solve the problem: delete(' ').scan(/\d+|\D+/)

lines = ['INVOICE # 2599', 'INVOICE 0185570', 'INVOICE: 1739', 'INVOICE- 45441', 'INVOICE:# 1234', 'INVOICE :# 5678']
lines.map{ |line| line.delete(' ').scan(/\d+|\D+/) }

output:

[
    [0] [
        [0] "INVOICE#",
        [1] "2599"
    ],
    [1] [
        [0] "INVOICE",
        [1] "0185570"
    ],
    [2] [
        [0] "INVOICE:",
        [1] "1739"
    ],
    [3] [
        [0] "INVOICE-",
        [1] "45441"
    ],
    [4] [
        [0] "INVOICE:#",
        [1] "1234"
    ],
    [5] [
        [0] "INVOICE:#",
        [1] "5678"
    ]
]

Upvotes: 1

Cary Swoveland
Cary Swoveland

Reputation: 110685

text =<<~END
INVOICE # 2599
INVOICE 0185570
INVOICE: 1739
INVOICE- 45441
INVOICE:# 1234
INVOICE :# 5678
END

text.each_line.map { |s| s.gsub(/\s+(?!\d)/,'').split }
  #=> [["INVOICE#", "2599"], ["INVOICE", "0185570"], ["INVOICE:", "1739"],
  #    ["INVOICE-", "45441"], ["INVOICE:#", "1234"], ["INVOICE:#", "5678"]]  

The regular expression used by gsub reads, "match one or more whitespaces not followed by a digit", (?!\d) being a negative lookahead. That is slightly different than s.gsub(/\s+(?=\D)/,''), "match one or more whitespaces followed by a non-digit", as the former removes the newline at the end of each line whereas the latter does not.

The steps are as follows:

enum1 = text.each_line
  #=> #<Enumerator: "INVOICE # 2599\nINVOICE 0185570\nINVOICE: 1739\n
  #     INVOICE- 45441\nINVOICE:# 1234\nINVOICE :#5678\n":each_line>

I've used String#each_line rather than String#lines (or other ways to create an array of lines) to avoid the creation of a temporary array.

enum2 = enum1.map
  #=> #<Enumerator: #<Enumerator: "INVOICE # 2599\nINVOICE 0185570\nINVOICE: 1739\n
  #     INVOICE- 45441\nINVOICE:# 1234\nINVOICE :# 5678\n":each_line>:map> 

s = enum2.next
  #=> "INVOICE # 2599\n" 
t = s.gsub(/\s+(?!\d)/,'')
  #=> "INVOICE# 2599" 
t.split 
  #=> ["INVOICE#", "2599"] 

s = enum2.next
  #=> "INVOICE 0185570\n" 
t = s.gsub(/\s+(?!\d)/,'')
  #=> "INVOICE 0185570" 
t.split 
  #=> ["INVOICE", "0185570"] 

and so on.

Another way of doing this is to remove whitespaces not followed by a digit before the string is divided into lines:

text.gsub(/\s+(?!\d)/, '').each_line.map(&:split)
 #=> [["INVOICE#", "2599"], ["INVOICE", "0185570"], ["INVOICE:", "1739"],
 #    ["INVOICE-", "45441"], ["INVOICE:#", "1234"], ["INVOICE:#", "5678"]]

Upvotes: 0

3limin4t0r
3limin4t0r

Reputation: 21130

You can use \D to match non-digit. Capture both the word and the non-digits in the first group and the digits in the second group, then remove the spaces in the first capture group. Here is an example of how it might look:

text.scan(/(\w+\D+)(\d+)/).each { |group_1,| group_1.delete!(' ') }
#=> [["INVOICE#", "2599"], ["INVOICE", "0185570"], ["INVOICE:", "1739"], ["INVOICE-", "45441"], ["INVOICE:#", "1234"], ["INVOICE:#", "5678"]]

You could also use gsub! or tr! instead of delete!. Replacing \D with \W (non-word character) would also work.

Keep in mind that \w equals [A-Za-z0-9_] and can also match digits and underscores.

Upvotes: 1

Related Questions