string matching in python

Question

I am getting trouble with the following matter.Let's say, I have some string in two list in a dictionary:

 left                                right
british                             7
cuneate nucleus                     Medulla oblongata
Motoneurons                         anterior

And I have some test lines in a file as like below:

British Meanwhile is the studio 7 album by british pop band 10cc 7.
Medulla oblongata,the name refers collectively to the cuneate nucleus and gracile nucleus, which are present at the junction between the spinal cord and the medulla oblongata.
Terior horn cells, motoneurons located in the spinal.

I want to get output as like following way:

British Meanwhile is the studio 7 album by british pop band 10cc 7.
Medulla oblongata,the name refers collectively to the cuneate nucleus and gracile nucleus, which are present at the junction between the spinal cord and the medulla oblongata.

I tried with the following code:

import re

def textReturn(left, right):
    text = ""
    filetext = open(text.xml, "r").read()
    linelist = re.split(u'[
|
]+',filetext)

    for i in linelist:
        left = left.strip()
        right = right.strip()

        if left in i and right in i:
            i1 = re.sub('(?i)(\s+)(%s)(\s+)'%left, '\1\2\3', i)
            i2 = re.sub('(?i)(\s+)(%s)(\s+)'%right, '\1\2\3', i1)
            text = text + i2 + "
"         
    return text

But it gives me:

'British meanwhile is the studio 7 album by British pop band 10cc 7.'.
Medulla oblongata,the name refers collectively to the cuneate nucleus and gracile nucleus, which are present at the junction between the spinal cord and the medulla oblongata.
Terior horn cells, motoneurons located in the spinal.

i.e It can't tag if there are string at the beginning & end .

Also,I just want to get return those line ,which matches both left & right strings, NOT others line.

Any solution please! Thanks a lot!!!

Ray Toal · Accepted Answer

It doesn't tag at the beginning and the end because you expect one or more spaces before and after your keywords.

Instead of \s+, use \b (word break).

ADDENDUM

Actual code:

import re

dict = [('british','7'),('cuneate nucleus','Medulla oblongata'),('Motoneurons','anterior')]

filetext = """British Meanwhile is the studio 7 album by british pop band 10cc 7.
Medulla oblongata,the name refers collectively to the cuneate nucleus and gracile nucleus, which are present at the junction between the spinal cord and the medulla oblongata.
Terior horn cells, motoneurons located in the spinal.
"""

linelist = re.split(u'[
|
]+', filetext)

s_tag = re.compile(r"(]+>)(.*?)()")

for i in range(3):
    left, right = dict[i]

    line_parts = re.search(s_tag, linelist[i])
    start = line_parts.group(1)
    content = line_parts.group(2)
    end = line_parts.group(3)

    left_match = "(?i)\b(%s)\b" % left
    right_match = "(?i)\b(%s)\b" % right
    if re.search(left_match, content) and re.search(right_match, content):
        line1 = re.sub(left_match, '\1', content)
        line2 = re.sub(right_match, '\1', line1)
        print(line_parts.group(1) + line2 + line_parts.group(3))

This is the basis for a short-term solution, but long-term you should try out the XML parser approach.

string matching in python

Answers (2)

Related Questions