alvas
alvas

Reputation: 122052

How to find the character position of something specified with a <tag>...</>? Python

I am trying to get the position of the < and > that are realtag when they are embedded in something like this <tag "510270">calculate</>.

I have sentences like these:

sentence = "After six weeks and seventeen tentative approaches the only serious 
tender came from Daniel. He had offered a paltry #2 a week for the one-time 
woodman's home, sane enough in this, at least, to <tag "510270">calculate</> 
safety to the nearest new penny piece. "

sentence2 = "After six weeks and seventeen tentative approaches the only serious 
tender came from Daniel. He had offered a paltry #2 a week for the one-time 
woodman's < home, sane enough in this, at least, to <tag "510270">calculate</> 
safety to the nearest new penny > piece. "

sentence3 = "After six weeks and seventeen tentative approaches the only serious 
tender came from Daniel. He had offered a paltry #2 a week for the one-time 
woodman's > home, sane enough in this, at least, to <tag "510270">calculate</> 
safety to the nearest new penny < piece. "

I need the cfrom and incfrom to be position of the 1st and 2nd < within the <tag "XXXX">...</> and I need the cto and incto to be position of the 2nd and 1st > within the <tag "XXXX">...</>

How could i do it also for sentences like sentence2 and sentence3, where < or > occurs outside of the <tag "XXXX">...</>?

For sentence1, i could simply do this:

cfrom,cto = 0,0
for i,c in enumerate(sentence1):
  if c == "<":
    cfrom == i
  break

for i,c in enumerate(sentence1.reverse):
  if c == ">":
    cto == len(sentence)-i
  break

incfrom incto = 0,0
fromtrigger, totrigger = False, False
for i,c in enumerate(sentence1[cfrom:]):
  if c == ">":
    incfrom = cfrom+i
  break

for i,c in enumerate(sentence1[incfrom:cto]):
  if c == "<":
    incto = i
  break

Upvotes: 1

Views: 97

Answers (2)

Mike Webb
Mike Webb

Reputation: 9003

What about something like the following where you keep track of where you are when you find the tag:

def parseSentence(sentence):
    cfrom, cto, incfrom, incto = 0, 0, 0, 0
    place = '' #to keep track of where we are

    for i in range(len(sentence)):
        c = sentence[i]
        if (c == '<'):
            #check for 'cfrom'
            if (sentence[i : i + 4] == '<tag'):
                cfrom = i
                place = 'botag' #begin-open-tag
            #check for 'incfrom'
            elif (sentence[i + 1] == '/' and place == 'intag'):
                incfrom = i
                place = 'bctag' #begin-close-tag
        elif (c == '>'):
            #check for 'cto'
            if (place == 'botag'): #just after '<tag...'
                cto = i
                place = 'intag' #now within the XML tag
            #check for 'incto'
            elif (place == 'bctag'):
                incto = i
                place = ''
                yield (cfrom, cto, incfrom, incto)

This should work correctly for all your sentences, but note that it will really only work correctly if there is only one <tag>...</> in your sentence. If there are more than one it will return the positions of the last <tag>...</>.

Edit: If you add a yield into the function it will iterate over the positions of all <tag>...</> tags in your sentence if you have more than one (see above).

Upvotes: 1

stackErr
stackErr

Reputation: 4170

If I understand this correctly, this should work (assuming you dont change the variables i ,c)

cfrom,cto = 0,0
for i,c in enumerate(sentence1):
  if c == "<tag":
    cfrom == i 
  break

for i,c in enumerate(sentence1):
  if c == ">":
    cto == i \\going forward from cfrom
  break

incfrom incto = 0,0
fromtrigger, totrigger = False, False
for i,c in enumerate(sentence1[cto:]):\\after the tag is opened, look for the start of closing tag
  if c == "</":
    incfrom = i
  break
for i,c in enumerate(sentence1[cto:]):
  if c == ">":
    incto = i
  break

Upvotes: 0

Related Questions