Reputation: 123
I am quite new in python.I want to match string in some lines of a file.Let's say, I have string:
british 7
German 8
France 90
And I have some lines in a file as like:
<s id="69-7">...Meanwhile is the studio 7 album by British pop band 10cc.</s>
<s id="15-8">...And Then There Were Three... is the ninth studio album by the german band Genesis 8 and was released in 1978.</s>
<s id="1990-2">Magnum Nitro Express is a France centerfire fire rifle cartridge 90.</s>
I want to get output as like:
<s id="69-7">...Meanwhile is the studio <w2>7</w2> album by <w1>British</w1> pop band 10cc.</s>
<s id="15-8">...And Then There Were Three... is the ninth studio album by the <w1>german</w1> band Genesis <w2>8</w2> and was released in 1978.</s>
<s id="1990-2">Magnum Nitro Express is a <w1>France</w1> centerfire fire rifle cartridge <w2>90</w2>.</s>
I tried with the following code:
for i in file:
if left in i and right in i:
line = i.replace(left, '<w1>' + left + '</w1>')
lineR = line.replace(right, '<w2>' + right + '</w2>')
text = text + lineR + "\n"
continue
return text
But, it also match string from id.eg.
<s id="69-<w2>7</w2>">...Meanwhile is the studio <w2>7</w2> album by <w1>British</w1> pop band 10cc.</s>
So, is there any way to search string as words not as character so that I can escape <s id="69-<w2>7</w2>">
?
Thanks in advance for any kind of help.
Upvotes: 3
Views: 465
Reputation: 1127
You should use regular expressions to specifically replace only individual words, not word parts.
Something like
import re
left='british'
right='7'
i1 = re.sub('(?i)(\s+)(%s)(\s+)'%left, '\\1<w1>\\2</w1>\\3', i)
i2 = re.sub('(?i)(\s+)(%s)(\s+)'%right, '\\1<w2>\\2</w2>\\3', i1)
print(i2)
which gives us '<s id="69-7">...Meanwhile is the studio <w2>7</w2> album by <w1>British</w1> pop band 10cc.</s>'
And if such approach leads to errors, you can try a more refined code, like
import re
def do(left, right, line):
parts = [x for x in re.split('(<[^>]+>)', line) if x]
for idx, l in enumerate(parts):
lu = l.upper()
if (not ('<s' in l or 's>' in l) and
(left.upper() in lu and right.upper() in lu)):
l = re.sub('(?i)(\s+)(%s)(\s+)'%left, '\\1<w1>\\2</w1>\\3', l)
l = re.sub('(?i)(\s+)(%s)(\s+)'%right, '\\1<w2>\\2</w2>\\3', l)
parts[idx] = l
return ''.join(parts)
line = '<s id="69-7">...Meanwhile is the studio 7 album by British pop band 10cc.</s>'
print(do('british', '7', line))
print(do('british', '-7', line))
Upvotes: 3
Reputation: 27575
I have something rather complicated, but I wrote it in hurry and for the moment it does the job.
Note that:
I added 'in France' after Meanwhile is the studio 7 album by British pop band 10cc
and that only British is modified
'1978' in by the german band Genesis 8 and was released in 1978 isn't modified while the '8' is modified.
That's the reason for which it is complicated.
But I fear that,despite this complication, it won't be exact for all the sentences possible.
Improvement should be done to make that idi would always be the correct musical group's name, and not always the first one as it is in this present solution. But it's a hard work without knowing what you exactly want
ss ='''british 7
German 8
France 90'''
text = '''<s id="69-7">...Meanwhile is the studio 7 album by British pop band 10cc in France.</s>
<s id="15-8">...And Then There Were Three... is the ninth studio album by the german band Genesis 8 and was released in 1978.</s>
<s id="1990-2">Magnum Nitro Express is a France centerfire fire rifle cartridge 90.</s>
'''
import re
regx = re.compile('^(.+?)[ \t]+(\d+)',re.MULTILINE)
dico = dict((a.lower(),b) for (a,b) in regx.findall(ss))
print 'dico==',dico
print '\n\n'
rogx = re.compile('(<s id="[\d-]+">|</s>\r?\n)')
splitted = rogx.split(text)
print 'splitted==\n',splitted
print '=================\n'
def repl(mat):
idi = (b for (a,b) in the if b).next().lower()
x,y = mat.groups()
if x:
if dico[idi.lower()]==x:
return '<w2>%s</w2>' % x
else:
return x
if y :
if y.lower()==idi:
return '<w1>%s</w1>' % y
else:
return y
rigx = re.compile('(\d+)|(' + '|'.join(dico.keys()) + ')',re.IGNORECASE)
for i,el in enumerate(splitted[0::2]):
if el:
print '-----------------------------'
print '* index in splitted==',2*i
print '\n* el==\n',repr(el)
print '\n* rigx.findall(el)==\n',rigx.findall(el)
the = rigx.findall(el)
print '\n* modified el:\n',rigx.sub(repl,el)
splitted[2*i] = rigx.sub(repl,el)
print '\n\n##################################\n\n'
print 'modified splitted==\n',splitted
print
print ''.join(splitted)
result
dico== {'german': '8', 'british': '7', 'france': '90'}
splitted==
['', '<s id="69-7">', '...Meanwhile is the studio 7 album by British pop band 10cc in France.', '</s>\n', '', '<s id="15-8">', '...And Then There Were Three... is the ninth studio album by the german band Genesis 8 and was released in 1978.', '</s>\n', '', '<s id="1990-2">', 'Magnum Nitro Express is a France centerfire fire rifle cartridge 90.', '</s>\n', '']
=================
-----------------------------
* index in splitted== 2
* el==
'...Meanwhile is the studio 7 album by British pop band 10cc in France.'
* rigx.findall(el)==
[('7', ''), ('', 'British'), ('10', ''), ('', 'France')]
* modified el:
...Meanwhile is the studio <w2>7</w2> album by <w1>British</w1> pop band 10cc in France.
-----------------------------
* index in splitted== 6
* el==
'...And Then There Were Three... is the ninth studio album by the german band Genesis 8 and was released in 1978.'
* rigx.findall(el)==
[('', 'german'), ('8', ''), ('1978', '')]
* modified el:
...And Then There Were Three... is the ninth studio album by the <w1>german</w1> band Genesis <w2>8</w2> and was released in 1978.
-----------------------------
* index in splitted== 10
* el==
'Magnum Nitro Express is a France centerfire fire rifle cartridge 90.'
* rigx.findall(el)==
[('', 'France'), ('90', '')]
* modified el:
Magnum Nitro Express is a <w1>France</w1> centerfire fire rifle cartridge <w2>90</w2>.
##################################
modified splitted==
['', '<s id="69-7">', '...Meanwhile is the studio <w2>7</w2> album by <w1>British</w1> pop band 10cc in France.', '</s>\n', '', '<s id="15-8">', '...And Then There Were Three... is the ninth studio album by the <w1>german</w1> band Genesis <w2>8</w2> and was released in 1978.', '</s>\n', '', '<s id="1990-2">', 'Magnum Nitro Express is a <w1>France</w1> centerfire fire rifle cartridge <w2>90</w2>.', '</s>\n', '']
<s id="69-7">...Meanwhile is the studio <w2>7</w2> album by <w1>British</w1> pop band 10cc in France.</s>
<s id="15-8">...And Then There Were Three... is the ninth studio album by the <w1>german</w1> band Genesis <w2>8</w2> and was released in 1978.</s>
<s id="1990-2">Magnum Nitro Express is a <w1>France</w1> centerfire fire rifle cartridge <w2>90</w2>.</s>
I eliminated replmodel()
repl() takes the value of the rigx.findall(el)
I added a line the = rigx.findall(el) for that
Upvotes: 4
Reputation: 29103
The best way is to use regex. But in case 'left' and 'right' will always have at least one trailing and leading space then you can use a simple trick(just add leading and trailing space into your pattern):
line = file.replace(' ' + left + ' ', ' <w1>' + left + '</w1> ')
lineR = line.replace(' ' + right + ' ', ' <w2>' + right + '</w2> ')
Upvotes: 1