Ciaran
Ciaran

Reputation: 1149

Search html for text Python

I am using urllib2 to get a web-page, and I need to look for a specific value within the returned data.

Is the best way to do this by using Beautiful Soup and using the find method or by using a regex to search the data?

Here is a very basic example of the text that is returned by the request:

<html>
<body>
<table> 
   <tbody> 
      <tr>
         <td>
            <div id="123" class="services">
               <table>
                  <tbody>
                     <tr>
                        <td style="PADDING-LEFT:  5px"bgcolor="ffffff" class="style8"> Example BLAB BLAB BLAB </td>
                        <td style="PADDING-LEFT:  5px"bgcolor="ffffff" class="style8"> BLAB BLAB BLAB </td>
                        <td style="PADDING-LEFT:  5px"bgcolor="ffffff" class="style8"> BLAB BLAB BLAB </td>
                     </tr>

                     <tr>
                        <td style="PADDING-LEFT:  5px"bgcolor="ffffff" class="style8"> BLAB BLAB BLAB </td>
                        <td style="PADDING-LEFT:  5px"bgcolor="ffffff" class="style8"> BLAB BLAB BLAB </td>
                        <td style="PADDING-LEFT:  5px"bgcolor="ffffff" class="style8"> BLAB BLAB BLAB </td>
                     </tr>

                     <tr>
                        <td style="PADDING-LEFT:  5px"bgcolor="ffffff" class="style8"> BLAB BLAB BLAB </td>
                        <td style="PADDING-LEFT:  5px"bgcolor="ffffff" class="style8"> BLAB BLAB BLAB </td>
                        <td style="PADDING-LEFT:  5px"bgcolor="ffffff" class="style8"> BLAB BLAB BLAB </td>
                     </tr>
                  </tbody>
               </table>
            </div>
         </td>
      </tr>
   </tbody>
</body>
</html>

In this case I want to return "Example BLAB BLAB BLAB". The only thing that remains persistent within this is "Example" and I want to return all of the data within this particular tag.

Upvotes: 0

Views: 90

Answers (1)

falsetru
falsetru

Reputation: 368894

Don't use regular expression to parse html/xml.

Using BeautifulSoup, you can use css selector:

>>> from bs4 import BeautifulSoup
>>>
>>> html_str = '''
... <html>
... <body>
... <td style="PADDING-LEFT:  5px"bgcolor="ffffff" class="style8"> Example BLAB BLAB BLAB </td>
... <td style="PADDING-LEFT:  5px"bgcolor="ffffff" class="style8"> BLAB BLAB BLAB </td>
... <td style="PADDING-LEFT:  5px"bgcolor="ffffff" class="style8"> BLAB BLAB BLAB </td>
... <td style="PADDING-LEFT:  5px"bgcolor="ffffff" class="style8"> BLAB BLAB BLAB </td>
... </body>
... </html>
... '''
>>> soup = BeautifulSoup(html_str)
>>> for td in soup.select('.style8'):
...     print(td.text)
...
 Example BLAB BLAB BLAB
 BLAB BLAB BLAB
 BLAB BLAB BLAB
 BLAB BLAB BLAB

Upvotes: 5

Related Questions