Reputation: 122052
I am trying to get the position of the <
and >
that are realtag
when they are embedded in something like this <tag "510270">calculate</>
.
I have sentences like these:
sentence = "After six weeks and seventeen tentative approaches the only serious
tender came from Daniel. He had offered a paltry #2 a week for the one-time
woodman's home, sane enough in this, at least, to <tag "510270">calculate</>
safety to the nearest new penny piece. "
sentence2 = "After six weeks and seventeen tentative approaches the only serious
tender came from Daniel. He had offered a paltry #2 a week for the one-time
woodman's < home, sane enough in this, at least, to <tag "510270">calculate</>
safety to the nearest new penny > piece. "
sentence3 = "After six weeks and seventeen tentative approaches the only serious
tender came from Daniel. He had offered a paltry #2 a week for the one-time
woodman's > home, sane enough in this, at least, to <tag "510270">calculate</>
safety to the nearest new penny < piece. "
I need the cfrom and incfrom to be position of the 1st and 2nd <
within the <tag "XXXX">...</>
and I need the cto and incto to be position of the 2nd and 1st >
within the <tag "XXXX">...</>
How could i do it also for sentences like sentence2 and sentence3, where <
or >
occurs outside of the <tag "XXXX">...</>
?
For sentence1, i could simply do this:
cfrom,cto = 0,0
for i,c in enumerate(sentence1):
if c == "<":
cfrom == i
break
for i,c in enumerate(sentence1.reverse):
if c == ">":
cto == len(sentence)-i
break
incfrom incto = 0,0
fromtrigger, totrigger = False, False
for i,c in enumerate(sentence1[cfrom:]):
if c == ">":
incfrom = cfrom+i
break
for i,c in enumerate(sentence1[incfrom:cto]):
if c == "<":
incto = i
break
Upvotes: 1
Views: 97
Reputation: 9003
What about something like the following where you keep track of where you are when you find the tag:
def parseSentence(sentence):
cfrom, cto, incfrom, incto = 0, 0, 0, 0
place = '' #to keep track of where we are
for i in range(len(sentence)):
c = sentence[i]
if (c == '<'):
#check for 'cfrom'
if (sentence[i : i + 4] == '<tag'):
cfrom = i
place = 'botag' #begin-open-tag
#check for 'incfrom'
elif (sentence[i + 1] == '/' and place == 'intag'):
incfrom = i
place = 'bctag' #begin-close-tag
elif (c == '>'):
#check for 'cto'
if (place == 'botag'): #just after '<tag...'
cto = i
place = 'intag' #now within the XML tag
#check for 'incto'
elif (place == 'bctag'):
incto = i
place = ''
yield (cfrom, cto, incfrom, incto)
This should work correctly for all your sentences, but note that it will really only work correctly if there is only one <tag>...</>
in your sentence. If there are more than one it will return the positions of the last <tag>...</>
.
Edit: If you add a yield
into the function it will iterate over the positions of all <tag>...</>
tags in your sentence if you have more than one (see above).
Upvotes: 1
Reputation: 4170
If I understand this correctly, this should work (assuming you dont change the variables i ,c
)
cfrom,cto = 0,0
for i,c in enumerate(sentence1):
if c == "<tag":
cfrom == i
break
for i,c in enumerate(sentence1):
if c == ">":
cto == i \\going forward from cfrom
break
incfrom incto = 0,0
fromtrigger, totrigger = False, False
for i,c in enumerate(sentence1[cto:]):\\after the tag is opened, look for the start of closing tag
if c == "</":
incfrom = i
break
for i,c in enumerate(sentence1[cto:]):
if c == ">":
incto = i
break
Upvotes: 0