Vikram
Vikram

Reputation: 837

Output of soup.findall() as input for further text manipulation using re module

Trying to extract text from a webpage using BeautifulSoup. Want to pass on output of soup.findall() as input for further data cleansing using re module

Plain variable text input is working but if I pass on output of soup.findall(), its throwing the following error.

Traceback (most recent call last): File "scrape2.py", line 18, in url = re.search( 'http://[a-z.]/[A-Za-z/%0-9-]', univ) File "/usr/lib/python2.7/re.py", line 142, in search return _compile(pattern, flags).search(string) TypeError: expected string or buffer

Variable printing of soup.findall() is working. How to pass directly pass output of soup.findall() as input ro re.search command.

Complete Source Code

from BeautifulSoup import BeautifulSoup
import urllib2
import os
import re
page=urllib2.urlopen(url)


soup = BeautifulSoup(open("rr-ss.html").read())
univ=soup.findAll('div',{'id':'divBrand1'})

print univ
text = '<span class="normaltextblue"><a href="http://www.roya3d.com/zdae/bug/coastdfilm-coated%20tab">Rocks</a></span>&nbsp;&nbsp;&nbsp;'


#following command throwing error 
url = re.search( 'http://[a-z.]*/[A-Za-z/%0-9-]*', univ)

#following line working fine
url = re.search( 'http://[a-z.]*/[A-Za-z/%0-9-]*', text)

if url:
    found = url.group(0)    
    print found

Upvotes: 0

Views: 2714

Answers (3)

Paul
Paul

Reputation: 7325

I had a problem to scraping where we needed to get the rendered content, or the visible content in a typical browser. In the case below the non displayable tag is nested in a style tag, and is not visible in many browsers that I have checked. Other variations exist such as defining a class tag setting display to none. Then using this class for the div.

<html>
  <title>  Title here</title>

  <body>

    lots of text here <p> <br>
    <h1> even headings </h1>

    <style type="text/css"> 
        <div > this will not be visible </div> 
    </style>


  </body>

</html>

One solution posted above is:

html = Utilities.ReadFile('simple.html')
soup = BeautifulSoup.BeautifulSoup(html)
texts = soup.findAll(text=True)
visible_texts = filter(visible, texts)
print(visible_texts)


[u'\n', u'\n', u'\n\n        lots of text here ', u' ', u'\n', u' even headings ', u'\n', u' this will not be visible ', u'\n', u'\n']

This solution certainly has applications in many cases and does the job quite well generally but in the html posted above it retains the text that is not rendered. After searching SO a couple solutions came up here BeautifulSoup get_text does not strip all tags and JavaScript and here Rendered HTML to plain text using Python

import nltk

%timeit nltk.clean_html(html)
was returning 153 us per loop

... or using html2text

betterHTML = html.decode(errors='ignore')
%timeit html2text.html2text(betterHTML)
%3.09 ms per loop

Upvotes: 0

Steve Jessop
Steve Jessop

Reputation: 279285

findAll returns a list of HTML elements. A list is not a string, and HTML elements also are not strings, so you cannot apply a regex to them unless you first convert them to strings. So the answer to your actual question, "how to pass the output of findAll to regex.search()", is to use unicode(univ).

But your regex seems wrong -- aside from anything else it doesn't match the URL in your example, which has a digit in the network location.

Furthermore, there should only be one element with a given id (that's the point of id in HTML, it's unique in the document). So findAll seems wrong anyway, unless you're intentionally allowing for broken HTML.

You should probably do something like this:

url = soup.find('div', {'id':'divBrand1'}).a['href']

You'll also have to decide how to handle the possibility that the document doesn't contain the data you're looking for. The code I have shown throws exceptions, but you could check whether None is returned from .find() or .a if you'd prefer to handle it without exceptions. Call has_key() to see whether href is present on the <a> element.

Upvotes: 0

paramiao
paramiao

Reputation: 1

When you found this problem, you can just print the "dir(object)" and "type(object)", so findAll result is a list, you can just access on element of findAll.

By the way, from what you are doing, i am wondring if you want to get the href of the certain id? i suggest you can use the css selector, and use get('href'), for example

#get the divs
divbrands = soup.select('#divBrand1')
for divbrand in divbrands:
    #get all <a></a> tags
    links = divbrand.select('a')
    for link in links:
        #get all the href
        print link.get('href')

also you can write it in one line:

hrefs = [link.get('href') for link in soup.select('#divBrand1 > a')]

Upvotes: 0

Related Questions