Frederic Faure
Frederic Faure

Reputation: 13

Question about a multi line regex in python language

I want to perform the selection of a group of lines in a text file to get all jobs related to an ipref The test file is like this : job numbers : (1,2,3), ip ref : (10,12,10)

text file : 1 ... (several lines of text) xxx 10 2 ... (several lines of text) xxx 12 3 ... (several lines of text) xxx 10

i want to select job numbers for IPref=10.

Code :

#!/usr/bin/python

import re
import sys

fic=open('test2.xml','r')
texte=fic.read()
fic.close()


#pattern='\n?\d(?!(?:\n?xxx \d{2}\n)*)xxx 10'
pattern='\n?\d.*?xxx 10'

result= re.findall(pattern,texte, re.DOTALL)

i=1
for match in result:
    print("\nmatch:",i)
    i=i+1
    print(match)

Result :

match: 1
1
a
b
xxx 10

match: 2

1
a
b
xxx 12
1
a
b
xxx 10

i have tried to replace .* by a a negative lookahead assertion to only select if no expr like "\n?xxx \d{2}\n" is before "xxx 10" :

pattern='\n?\d(?!(?:\n?xxx \d{2}\n)*)xxx 10'

but it is not working ...

Upvotes: 0

Views: 72

Answers (4)

Frederic Faure
Frederic Faure

Reputation: 13

file :

job_number job_id
1 10202
bla bla
bla bla bla
xxx 100.10.10.100
2 10203
bla bla
bla bla bla
bla bla bla
xxx 100.10.10.102
3 10204
bla bla bla
bla bla bla
xxx 100.10.10.100

bash script with embedded python script :

#!/bin/bash

# function , $1 : ip of a printer
get_jobs_ip ()
{
cat <<EOF | python
import re

fic=open('test3.xml','r')
texte=fic.read()
fic.close()

"""
The pattern matches example with ip="100\.10\.10\.100" :
thank you to Fourth bird for the pattern !!!
#pattern='^\d\s+\d+(?:\n(?!xxx \d+\.\d+\.\d+\.\d+$).*)*\nxxx 100\.10\.10\.100$'

^ Start of string
\d Match a single digit (or \d+ for 1 or more)
(?: Non capture group
\n Match a newline
(?!xxx \d+\.\d+\.\d+\.\d+$) Negative lookahead to assert that the string is not xxx  followed by 1+ digits
.* If the assertion is true, match the whole line
)* Close the group and optionally repeat it
\nxxx 100\.10\.10\.100$ Match a newline, xxx  and 10
"""

ip="$1"
pattern_template='^\d\s+\d+(?:\n(?!xxx \d+\.\d+\.\d+\.\d+$).*)*\nxxx @ip@$'
pattern=pattern_template.replace('@ip@',ip)

result= re.findall(pattern,texte, re.MULTILINE)

i=1
for match in result:
    print("\nmatch:",i)
    i=i+1
    print(match)
EOF
}

get_jobs_ip "100\.10\.10\.100"
get_jobs_ip "100\.10\.10\.102"

result :

match: 1
1 10202
bla bla
bla bla bla
xxx 100.10.10.100

match: 2
3 10204
bla bla bla
bla bla bla
xxx 100.10.10.100

match: 1
2 10203
bla bla
bla bla bla
bla bla bla
xxx 100.10.10.102

Upvotes: 0

Frederic Faure
Frederic Faure

Reputation: 13

Thank you very much, (you saved my day !!) as you say :

pattern='^\d(?:\n(?!xxx \d+$).*)*\nxxx 10$'
result= re.findall(pattern,texte, re.MULTILINE)

result : OK, the line group (1..xxx 12) is ignored, NOTE : i can adapt it to a case where line 1 is a line giving job information and "xxx 12" is a line giving printer IP information.

match: 1
1
a
b
xxx 10

match: 2
1
a
b
xxx 10

Upvotes: 0

Frederic Faure
Frederic Faure

Reputation: 13

Good day to you :) and Thank you very much for your quick response!! i give you below the result Note : i have modified re.DOTALL by re.DOTALL|re.MULTILINE (because the result is none without that... Sorry for the previous presentation ... it wat not very clear)

Text file :

1
a
b
xxx 10
1
a
b
xxx 12
1
a
b
xxx 10

Code With your pattern :

#!/usr/bin/python

import re
import sys

fic=open('test2.xml','r')
texte=fic.read()
fic.close()
print(texte)

#pattern='<\/?(?!(?:span|br|b)(?: [^>]*)?>)[^>\/]*>'
#pattern='\n?\d(?!(?:\n?xxx \d{2}\n?)*?)xxx 10'
#pattern='\n?\d.*?xxx 10'
pattern='^\d(?:\n(?!xxx \d+$).*)*\nxxx 10$'

result= re.findall(pattern,texte, re.DOTALL|re.MULTILINE)

i=1
for match in result:
    print("\nmatch:",i)
    i=i+1
    print(match)

Result :

match: 1
1
a
b
xxx 10
1
a
b
xxx 12
1
a
b
xxx 10 

but i try to obtain :

match: 1
1
a
b
xxx 10

match 2 : 
1
a
b
xxx 10

Upvotes: 1

The fourth bird
The fourth bird

Reputation: 163277

You can write the pattern in this way, repeating the newline and asserting not xxx followed by 1 or more digits:

^\d(?:\n(?!xxx \d+$).*)*\nxxx 10$

The pattern matches:

  • ^ Start of string
  • \d Match a single digit (or \d+ for 1 or more)
  • (?: Non capture group
    • \n Match a newline
    • (?!xxx \d+$) Negative lookahead to assert that the string is not xxx followed by 1+ digits
    • .* If the assertion is true, match the whole line
  • )* Close the group and optionally repeat it
  • \nxxx 10$ Match a newline, xxx and 10

Regex demo

Upvotes: 1

Related Questions