FlashBanistan
FlashBanistan

Reputation: 456

How would I match a string that may or may not span multiple lines?

I have a document that when converted to text splits the phone number onto multiple lines like this:

(xxx)-xxx-
xxxx

For a variety of reasons related to my project I can't simply join the lines.

If I know the phonenumber="(555)-555-5555" how can I compile a regex so that if I run it over

(555)-555- 5555

it will match?

**EDIT

To help clarify my question here it is in a more abstract form.

test_string = "xxxx xx x xxxx"
text = """xxxx xx
x
xxxx"""

I need the test string to be found in the text. Newlines can be anywhere in the text and characters that need to be escaped should be taken into consideration.

Upvotes: 3

Views: 160

Answers (3)

Srdjan M.
Srdjan M.

Reputation: 3405

data = ["(555)-555-\n5555", "(55\n5)-555-\n55\n55", "(555\n)-555-\n5555", "(555)-555-5555"]

input = '(555)-555-5555'
#add new lines to input string
input = re.sub(r'(?!^|$)', r'\\n*', input)
#escape brackets ()
input = re.sub(r'(?=[()])', r'\\',input)

r = re.compile(input)

match = filter(r.match, data)

Code demo

Upvotes: 0

r.ook
r.ook

Reputation: 13878

A simple workaround would be to replace all the \n characters in the document text before you search it:

pat = re.compile(r'\(\d{3}\)-\d{3}\d{4}')
numbers = pat.findall(text.replace('\n',''))

# ['(555)-555-5555']

If this cannot be done for any reasons, the obvious answer, though unsightly, would be to handle a newline character between each search character:

pat = re.compile(r'\(\n*5\n*5\n*5\n*\)\n*-\n*5\n*5\n*5\n*-\n*5\n*5\n*5\n*5')

If you needed to handle any format, you can pad the format like so:

phonenumber = '(555)-555-5555'
pat = re.compile('\n*'.join(['\\'+i if not i.isalnum() else i for i in phonenumber]))

# pat 
# re.compile(r'\(\n*5\n*5\n*5\n*\)\n*\-\n*5\n*5\n*5\n*\-\n*5\n*5\n*5\n*5', re.UNICODE)

Test case:

import random
def rndinsert(s):
    i = random.randrange(len(s)-1)
    return s[:i] + '\n' + s[i:]

for i in range(10):
    print(pat.findall(rndinsert('abc (555)-555-5555 def')))

# ['(555)-555-5555']
# ['(555)-5\n55-5555']
# ['(555)-5\n55-5555']
# ['(555)-555-5555']
# ['(555\n)-555-5555']
# ['(5\n55)-555-5555']
# ['(555)\n-555-5555']
# ['(555)-\n555-5555']
# ['(\n555)-555-5555']
# ['(555)-555-555\n5']

Upvotes: 1

Ajax1234
Ajax1234

Reputation: 71461

You can search for a possible \n existing in the string:

import re
nums = ["(555)-555-\n5555", "(555)-555-5555"]
new_nums = [i for i in nums if re.findall('\([\d\n]+\)[\n-][\d\n]+-[\d\n]+', i)]

Output:

['(555)-555-\n5555', '(555)-555-5555']

Upvotes: 1

Related Questions