Abhijeet Mohanty
Abhijeet Mohanty

Reputation: 344

Scraping visible text

I am an absolute newbie in the field of web scraping and right now I want to extract visible text from a web page. I found a piece of code online :

import urllib2
from bs4 import BeautifulSoup

url = "http://www.espncricinfo.com/"
web_page = urllib2.urlopen(url)

soup = BeautifulSoup(url , "lxml")
print (soup.prettify())

To the above code, I get the following result :

    /usr/local/lib/python2.7/site-packages/bs4/__init__.py:282: UserWarning: "http://www.espncricinfo.com/" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
  ' that document to Beautiful Soup.' % decoded_markup
<html>
 <body>
  <p>
   http://www.espncricinfo.com/
  </p>
 </body>
</html>

Anyway I could get a more concrete result and what wrong is happening with the code. Sorry for being clueless.

Upvotes: 0

Views: 1076

Answers (2)

宏杰李
宏杰李

Reputation: 12168

soup = BeautifulSoup(web_page, "lxml")

you should pass a file-like object to BeautifulSoup,not url.

url is handled by urllib2.urlopen(url) and stored in web_page

Upvotes: 1

Youn Elan
Youn Elan

Reputation: 2452

Try passing the html document and not url to prettify to:

import urllib2
from bs4 import BeautifulSoup

url = "http://www.espncricinfo.com/"
web_page = urllib2.urlopen(url)

soup = BeautifulSoup(web_page , 'html.parser')
print (soup.prettify().encode('utf-8'))

Upvotes: 1

Related Questions