Reputation: 13
I am trying to extract specific text from a website with the following html:
...
<tr>
<td>
<strong>
Location:
</strong>
</td>
<td colspan="3">
90 km S. of Prince Rupert
</td>
</tr>
...
I want to extract the text that comes after "Location:" (i.e. "90 km S. of Prince Rupert"). There are a whole load of similar websites that I want to loop through and grab the text following "Location:"
I am quite new to python and haven't been able to find a solution to extracting text based on a condition like this.
Upvotes: 1
Views: 1182
Reputation: 4940
My understanding is that BS does not handle malformed html as well as LXML. However, I could be wrong but I have generally used lxml to handle these types of problems. Here is some code that you can play with to better understand how to play with the elements. There are lots of approaches.
The best place to get lxml in my opinion is here
from lxml import html
ms = '''<tr>
<td>
<strong>
Location:
</strong>
</td>
<td colspan="3">
90 km S. of Prince Rupert
</td>
<mytag>
Hello World
</mytag>
</tr>'''
mytree = html.fromstring(ms) #this creates a 'tree' in memory
for e in mytree.iter(): # iterate through the elements
if e.tag == 'td': #focus on the elements that are td elements
if 'location' in e.text_content().lower(): # if location is in the text of a td
for sib in e.itersiblings(): # find all the siblings of the td
sib.text_content() # print the text
'\n 90 km S. of Prince Rupert\n
There is a lot to learn here but lxml is pretty introspective
>>> help (e.itersiblings)
Help on built-in function itersiblings:
itersiblings(...)
itersiblings(self, tag=None, preceding=False)
Iterate over the following or preceding siblings of this element.
The direction is determined by the 'preceding' keyword which
defaults to False, i.e. forward iteration over the following
siblings. When True, the iterator yields the preceding
siblings in reverse document order, i.e. starting right before
the current element and going left. The generated elements
can be restricted to a specific tag name with the 'tag'
keyword.
Note - I changed the string a little bit and added mytag so see the new code based on the help for itersiblings
for e in mytree.iter():
if e.tag == 'td':
if 'location' in e.text_content().lower():
for sib in e.itersiblings(tag = 'mytag'):
sib.text_content()
'\n hello world\n
Upvotes: 2