Reputation: 1819
I'm crawling through a simple, but long HTML chunk, which is similar to this:
<table>
<tbody>
<tr>
<td> Some text </td>
<td> Some text </td>
</tr>
<tr>
<td> Some text
<br/>
Some more text
</td>
</tr>
</tbody>
</table>
I'm collecting the data with following little python code (using lxml):
for element in root.iter():
if element == 'td':
print element.text
Some of the texts are divided into two rows, but mostly they fit in a single row. The problem is within the divided rows.
The root element is the 'table' tag. That little code can print out all the other texts, but not what comes after the 'br' tags. If I don't exclude non-td tags, the code tries to print possible text from inside the 'br' tags, but of course there's nothing in there and thus this prints just empty new line.
However after this 'br', the code moves to the next tag on the line within the iteration, but ignores that data that's still inside the previous 'td' tag.
How can I get also the data after those tags?
Edit: It seems that some of the 'br' tags are self closing, but some are left open
<td>
Some text
<br>
Some more text
</td>
The element.tail method, suggested in the first answer, does not seem to be able to get the data after that open tag.
Edit2: Actually it works. Was my own mistake. Forgot to mention that the "print element.text" part was encapsulated by try-except, which in case of the br tag caught an AttributeError, because there's nothing inside the br tags. I had set the exception to just pass and print out nothing. Inside the same try-except I tried also print out the tail, but printing out the tail was never reached, because of the exception that happened before it.
Upvotes: 1
Views: 4669
Reputation: 1
You can target the br element and use . get(index) to fetch the underlying DOM element, the use nextSibling to target the text node. Then nodeValue property can be used to get the text.
Upvotes: -1
Reputation: 5302
To me below is working to extract all the text after br
-
normalize-space(//table//br/following::text()[1])
Working example is at.
Upvotes: 1
Reputation: 11971
Because <br/>
is a self-closing tag, it does not have any text
content. Instead, you need to access it's tail
content. The tail
content is the content after the element's closing tag, but before the next opening tag. To access this content in your for loop you will need to use the following:
for element in root.iter():
element_text = element.text
element_tail = element.tail
Even if the br
tag is an opening tag, this method will still work:
from lxml import etree
content = '''
<table>
<tbody>
<tr>
<td> Some text </td>
<td> Some text </td>
</tr>
<tr>
<td> Some text
<br>
Some more text
</td>
</tr>
</tbody>
</table>
'''
root = etree.HTML(content)
for element in root.iter():
print(element.tail)
Output
Some more text
Upvotes: 4