Liza
Liza

Reputation: 123

string searching with returning matched line in python

I am quite new in python.I want to match string in some lines of a file.Let's say, I have string:

british    7
German     8
France     90

And I have some lines in a file as like:

<s id="69-7">...Meanwhile is the studio 7 album by British pop band 10cc.</s>
<s id="15-8">...And Then There Were Three... is the ninth studio album by the german band Genesis 8 and was released in 1978.</s>
<s id="1990-2">Magnum Nitro Express is a France centerfire fire rifle cartridge 90.</s>

I want to get output as like:

<s id="69-7">...Meanwhile is the studio <w2>7</w2> album by <w1>British</w1> pop band 10cc.</s>
<s id="15-8">...And Then There Were Three... is the ninth studio album by the <w1>german</w1> band Genesis <w2>8</w2> and was released in 1978.</s>
<s id="1990-2">Magnum Nitro Express is a <w1>France</w1> centerfire fire rifle cartridge <w2>90</w2>.</s>

I tried with the following code:

for i in file:      
    if left in i and right in i:
        line = i.replace(left, '<w1>' + left + '</w1>')
        lineR = line.replace(right, '<w2>' + right + '</w2>')
        text = text + lineR + "\n"
        continue
     return text

But, it also match string from id.eg.

<s id="69-<w2>7</w2>">...Meanwhile is the studio <w2>7</w2> album by <w1>British</w1> pop band 10cc.</s>

So, is there any way to search string as words not as character so that I can escape <s id="69-<w2>7</w2>"> ?

Thanks in advance for any kind of help.

Upvotes: 3

Views: 465

Answers (3)

Alex Laskin
Alex Laskin

Reputation: 1127

You should use regular expressions to specifically replace only individual words, not word parts.

Something like

import re
left='british'
right='7'
i1 = re.sub('(?i)(\s+)(%s)(\s+)'%left, '\\1<w1>\\2</w1>\\3', i)
i2 = re.sub('(?i)(\s+)(%s)(\s+)'%right, '\\1<w2>\\2</w2>\\3', i1)
print(i2)

which gives us '<s id="69-7">...Meanwhile is the studio <w2>7</w2> album by <w1>British</w1> pop band 10cc.</s>'

And if such approach leads to errors, you can try a more refined code, like

import re

def do(left, right, line):
    parts = [x for x in re.split('(<[^>]+>)', line) if x]
    for idx, l in enumerate(parts):
        lu = l.upper()
        if (not ('<s' in l or 's>' in l) and
            (left.upper() in lu and right.upper() in lu)):
            l = re.sub('(?i)(\s+)(%s)(\s+)'%left, '\\1<w1>\\2</w1>\\3', l)
            l = re.sub('(?i)(\s+)(%s)(\s+)'%right, '\\1<w2>\\2</w2>\\3', l)
            parts[idx] = l

    return ''.join(parts)


line = '<s id="69-7">...Meanwhile is the studio 7 album by British pop band 10cc.</s>'
print(do('british', '7', line))
print(do('british', '-7', line))

Upvotes: 3

eyquem
eyquem

Reputation: 27575

I have something rather complicated, but I wrote it in hurry and for the moment it does the job.

Note that:

  • I added 'in France' after Meanwhile is the studio 7 album by British pop band 10cc
    and that only British is modified

  • '1978' in by the german band Genesis 8 and was released in 1978 isn't modified while the '8' is modified.

That's the reason for which it is complicated.

But I fear that,despite this complication, it won't be exact for all the sentences possible.

Improvement should be done to make that idi would always be the correct musical group's name, and not always the first one as it is in this present solution. But it's a hard work without knowing what you exactly want

ss ='''british    7
German     8
France     90'''


text = '''<s id="69-7">...Meanwhile is the studio 7 album by British pop band 10cc in France.</s>
<s id="15-8">...And Then There Were Three... is the ninth studio album by the german band Genesis 8 and was released in 1978.</s>
<s id="1990-2">Magnum Nitro Express is a France centerfire fire rifle cartridge 90.</s>
'''




import re
regx = re.compile('^(.+?)[ \t]+(\d+)',re.MULTILINE)




dico = dict((a.lower(),b) for (a,b) in regx.findall(ss))
print 'dico==',dico
print '\n\n'



rogx = re.compile('(<s id="[\d-]+">|</s>\r?\n)')
splitted = rogx.split(text)
print 'splitted==\n',splitted

print '=================\n'

def repl(mat):
    idi = (b for (a,b) in the if b).next().lower()
    x,y = mat.groups()
    if x:
        if dico[idi.lower()]==x:
            return '<w2>%s</w2>' % x
        else:
            return x
    if y :
        if y.lower()==idi:
            return '<w1>%s</w1>' % y
        else:
            return y

rigx = re.compile('(\d+)|(' + '|'.join(dico.keys()) + ')',re.IGNORECASE)

for i,el in enumerate(splitted[0::2]):
    if el:
        print '-----------------------------'
        print '* index in splitted==',2*i
        print '\n* el==\n',repr(el)
        print '\n* rigx.findall(el)==\n',rigx.findall(el)
        the = rigx.findall(el)
        print '\n* modified el:\n',rigx.sub(repl,el)
        splitted[2*i] = rigx.sub(repl,el)


print '\n\n##################################\n\n'

print 'modified splitted==\n',splitted
print
print ''.join(splitted)

result

dico== {'german': '8', 'british': '7', 'france': '90'}



splitted==
['', '<s id="69-7">', '...Meanwhile is the studio 7 album by British pop band 10cc in France.', '</s>\n', '', '<s id="15-8">', '...And Then There Were Three... is the ninth studio album by the german band Genesis 8 and was released in 1978.', '</s>\n', '', '<s id="1990-2">', 'Magnum Nitro Express is a France centerfire fire rifle cartridge 90.', '</s>\n', '']
=================

-----------------------------
* index in splitted== 2

* el==
'...Meanwhile is the studio 7 album by British pop band 10cc in France.'

* rigx.findall(el)==
[('7', ''), ('', 'British'), ('10', ''), ('', 'France')]

* modified el:
...Meanwhile is the studio <w2>7</w2> album by <w1>British</w1> pop band 10cc in France.
-----------------------------
* index in splitted== 6

* el==
'...And Then There Were Three... is the ninth studio album by the german band Genesis 8 and was released in 1978.'

* rigx.findall(el)==
[('', 'german'), ('8', ''), ('1978', '')]

* modified el:
...And Then There Were Three... is the ninth studio album by the <w1>german</w1> band Genesis <w2>8</w2> and was released in 1978.
-----------------------------
* index in splitted== 10

* el==
'Magnum Nitro Express is a France centerfire fire rifle cartridge 90.'

* rigx.findall(el)==
[('', 'France'), ('90', '')]

* modified el:
Magnum Nitro Express is a <w1>France</w1> centerfire fire rifle cartridge <w2>90</w2>.


##################################


modified splitted==
['', '<s id="69-7">', '...Meanwhile is the studio <w2>7</w2> album by <w1>British</w1> pop band 10cc in France.', '</s>\n', '', '<s id="15-8">', '...And Then There Were Three... is the ninth studio album by the <w1>german</w1> band Genesis <w2>8</w2> and was released in 1978.', '</s>\n', '', '<s id="1990-2">', 'Magnum Nitro Express is a <w1>France</w1> centerfire fire rifle cartridge <w2>90</w2>.', '</s>\n', '']

<s id="69-7">...Meanwhile is the studio <w2>7</w2> album by <w1>British</w1> pop band 10cc in France.</s>
<s id="15-8">...And Then There Were Three... is the ninth studio album by the <w1>german</w1> band Genesis <w2>8</w2> and was released in 1978.</s>
<s id="1990-2">Magnum Nitro Express is a <w1>France</w1> centerfire fire rifle cartridge <w2>90</w2>.</s>

EDIT 1

I eliminated replmodel()

repl() takes the value of the rigx.findall(el)
I added a line the = rigx.findall(el) for that

Upvotes: 4

Artsiom Rudzenka
Artsiom Rudzenka

Reputation: 29103

The best way is to use regex. But in case 'left' and 'right' will always have at least one trailing and leading space then you can use a simple trick(just add leading and trailing space into your pattern):

line = file.replace(' ' + left + ' ', ' <w1>' + left + '</w1> ')
lineR = line.replace(' ' + right + ' ', ' <w2>' + right + '</w2> ')        

Upvotes: 1

Related Questions