Dejwi
Dejwi

Reputation: 4487

Python get a HTML element/node/tag from exact position

I've got a long html document and I know the exact position of some text within it. For example:

<html>
  <body>
    <div>
      <a>
        <b>
          I know the exact position of this text
        </b>
        <i>
          Another text
        </i>
      </a>
    </div>
  </body>
</html>

I know that sentence "I know the exact position of this text" start at character number 'x' and ends at character number 'y'. But I have to get the whole tag/node/element, which holds this value. And possible several it's ancestors.

How can I easily handle it?

//edit

To state it clearly - the only thing I've got is an integer value, which describes the start of sentence.

For example - 2048.

I cannot assume anything about the structure of document. Starting from some point I have to go ancestor by ancestor throughout nodes.

Even the sentence pointed by position(2048) do not have to be unique.

Upvotes: 0

Views: 2045

Answers (2)

pepr
pepr

Reputation: 20762

You can read the content of the whole HTML document as a string. Then you can get the modified string with the marker (HTML anchor element with unique id) and parse the string it as if the marker was in the original doc using xml.etree.ElementTree. Then you can find the parent element of the marker using XPath, and remove the auxiliary marker. The result contains the structure as if the original doc was parsed. But now you know the element with the text!

Warning: You have to know if the position is the byte position or the abstract character position. (Think about multibyte encodings or non-fixed length of sequences that encode some characters. Think also about line ending -- one or two bytes.)

Try the example where the example from your question was stored in data.html using Windows line endings:

#!python3

import xml.etree.ElementTree as ET

fname = 'doc.html'
pos = 64

with open(fname, encoding='utf-8') as f:
    content = f.read()

# The position_id will be used in XPath, the position_anchor
# uses the variable only for readability. The position anchor
# has the form of an HTML element to be found easily using 
# the XPath expression.
position_id = 'my_unique_position_{}'.format(pos)
position_anchor = '<a id="{}" />'.format(position_id)

# The modified content has one extra anchor as the position marker.
modified_content = content[:pos] + position_anchor + content[pos:]

root = ET.fromstring(modified_content)
ET.dump(root)
print('----------------')

# Now some examples for getting the info around the point.
# '.' = from here; '//' = wherever; 'a[@id=...]' = anchor (a) element
# with the attribute id with the value. 
# We will not use it later -- only for demonstration.
anchor_element = root.find('.//a[@id="{}"]'.format(position_id))
ET.dump(anchor_element)
print('----------------')

# The text at the original position -- the text became the tail 
# of the element.
print(repr(anchor_element.tail))
print('================')

# Now, from scratch, get the nearest parent from the position.
parent = root.find('.//a[@id="{}"]/..'.format(position_id))
ET.dump(parent)
print('----------------')

# ... and the anchor element (again) as the nearest child
# with the attributes.
anchor = parent.find('./a[@id="{}"]'.format(position_id))
ET.dump(anchor)
print('----------------')

# If the marker split the text, part of the text belongs to 
# the parent, part is the tail of the anchor marker.
print(repr(parent.text))
print(repr(anchor.tail))
print('----------------')

# Modify the parent to remove the anchor element (to get
# the original structure without the marker. Do not forget
# that the text became the part of the marker element as the tail.
parent.remove(anchor)
parent.text += anchor.tail
ET.dump(parent)
print('----------------')

# The structure of the whole document now does not contain 
# the added anchor marker, and you get the reference
# to the nearest parent.
ET.dump(root)
print('----------------')

It prints the following:

c:\_Python\Dejwi\so25370255>a.py
<html>
  <body>
    <div>
      <a>
        <b>
          I know<a id="my_unique_position_64" /> the exact position of this text

        </b>
        <i>
          Another text
        </i>
      </a>
    </div>
  </body>
</html>
----------------
<a id="my_unique_position_64" /> the exact position of this text

----------------
' the exact position of this text\n        '
================
<b>
          I know<a id="my_unique_position_64" /> the exact position of this text

        </b>

----------------
<a id="my_unique_position_64" /> the exact position of this text

----------------
'\n          I know'
' the exact position of this text\n        '
----------------
<b>
          I know the exact position of this text
        </b>

----------------
<html>
  <body>
    <div>
      <a>
        <b>
          I know the exact position of this text
        </b>
        <i>
          Another text
        </i>
      </a>
    </div>
  </body>
</html>
----------------

Upvotes: 0

Chrispresso
Chrispresso

Reputation: 4071

Assuming that <b> is unique in this instance you can use the XPath with xml.etree.elementtree.

import xml.etree.elementtree as ET
tree = ET.parse('xmlfile')
root = tree.get(root)
myEle = root.findall(".//*[b]")

myEle will now hold the reference to the parent of 'b', which in this case is 'a'.

If you just want the b element, then you can do this:

myEle = root.findall(".//b")

If you want the children of a you can do a couple different things:

myEle = root.findall(".//a//")
myEle = root.findall('.//*[a]//*')[1:]

For more information on XPath take a look here: XPath

Upvotes: 1

Related Questions