How to extract text from html conditionally in beautifulsoup

Question

I am trying to extract specific text from a website with the following html:

              ...
               
                
                 
                  Location:
                 
                
                
                 90 km S. of Prince Rupert
                
               
              ...

I want to extract the text that comes after "Location:" (i.e. "90 km S. of Prince Rupert"). There are a whole load of similar websites that I want to loop through and grab the text following "Location:"

I am quite new to python and haven't been able to find a solution to extracting text based on a condition like this.

PyNEwbie · Accepted Answer

My understanding is that BS does not handle malformed html as well as LXML. However, I could be wrong but I have generally used lxml to handle these types of problems. Here is some code that you can play with to better understand how to play with the elements. There are lots of approaches.

The best place to get lxml in my opinion is here

from lxml import html

ms = '''
            
             
              Location:
             
            
            
             90 km S. of Prince Rupert
            
            
            Hello World
            
           '''

mytree = html.fromstring(ms)  #this creates a 'tree' in memory
for e in mytree.iter():       # iterate through the elements
    if e.tag == 'td':         #focus on the elements that are td elements
        if 'location' in e.text_content().lower(): # if location is in the text of a td
            for sib in e.itersiblings(): # find all the siblings of the td
                sib.text_content()   # print the text

' 90 km S. of Prince Rupert

There is a lot to learn here but lxml is pretty introspective

>>> help (e.itersiblings)
Help on built-in function itersiblings:

itersiblings(...)
    itersiblings(self, tag=None, preceding=False)

    Iterate over the following or preceding siblings of this element.

The direction is determined by the 'preceding' keyword which
defaults to False, i.e. forward iteration over the following
siblings.  When True, the iterator yields the preceding
siblings in reverse document order, i.e. starting right before
the current element and going left.  The generated elements
can be restricted to a specific tag name with the 'tag'
keyword.

Note - I changed the string a little bit and added mytag so see the new code based on the help for itersiblings

for e in mytree.iter():
    if e.tag == 'td':
        if 'location' in e.text_content().lower():
            for sib in e.itersiblings(tag = 'mytag'):
                sib.text_content()


 '
                hello world

How to extract text from html conditionally in beautifulsoup

Answers (1)

Related Questions