zaplec
zaplec

Reputation: 1819

How to get the text from a cell after <br/> tag?

I'm crawling through a simple, but long HTML chunk, which is similar to this:

<table>
  <tbody>
    <tr>
      <td> Some text </td>
      <td> Some text </td>
    </tr>
    <tr>
      <td> Some text 
        <br/>
           Some more text
      </td>
    </tr>
  </tbody>
</table>

I'm collecting the data with following little python code (using lxml):

for element in root.iter():
  if element == 'td': 
    print element.text

Some of the texts are divided into two rows, but mostly they fit in a single row. The problem is within the divided rows.

The root element is the 'table' tag. That little code can print out all the other texts, but not what comes after the 'br' tags. If I don't exclude non-td tags, the code tries to print possible text from inside the 'br' tags, but of course there's nothing in there and thus this prints just empty new line.

However after this 'br', the code moves to the next tag on the line within the iteration, but ignores that data that's still inside the previous 'td' tag.

How can I get also the data after those tags?

Edit: It seems that some of the 'br' tags are self closing, but some are left open

<td> 
     Some text
  <br>
     Some more text
</td>

The element.tail method, suggested in the first answer, does not seem to be able to get the data after that open tag.

Edit2: Actually it works. Was my own mistake. Forgot to mention that the "print element.text" part was encapsulated by try-except, which in case of the br tag caught an AttributeError, because there's nothing inside the br tags. I had set the exception to just pass and print out nothing. Inside the same try-except I tried also print out the tail, but printing out the tail was never reached, because of the exception that happened before it.

Upvotes: 1

Views: 4669

Answers (3)

Harshita Jain
Harshita Jain

Reputation: 1

You can target the br element and use . get(index) to fetch the underlying DOM element, the use nextSibling to target the text node. Then nodeValue property can be used to get the text.

Upvotes: -1

Learner
Learner

Reputation: 5302

To me below is working to extract all the text after br-

normalize-space(//table//br/following::text()[1])

Working example is at.

Upvotes: 1

gtlambert
gtlambert

Reputation: 11971

Because <br/> is a self-closing tag, it does not have any text content. Instead, you need to access it's tail content. The tail content is the content after the element's closing tag, but before the next opening tag. To access this content in your for loop you will need to use the following:

for element in root.iter():
    element_text = element.text
    element_tail = element.tail

Even if the br tag is an opening tag, this method will still work:

from lxml import etree

content = '''
<table>
  <tbody>
    <tr>
      <td> Some text </td>
      <td> Some text </td>
    </tr>
    <tr>
      <td> Some text 
        <br>
           Some more text
      </td>
    </tr>
  </tbody>
</table>
'''

root = etree.HTML(content)

for element in root.iter():
    print(element.tail)

Output

Some more text

Upvotes: 4

Related Questions