Reputation: 65
I wanted to capture some of the text in html using python. example..
#!/usr/bin/python
import urllib
open = urllib.urlopen('http://localhost/main.php')
read = open.read()
print read
and this source code of the target url
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252" />
<title>Untitled Document</title>
</head>
<body>
This is body!
</body>
</html>
and how if I just want to catch the words "This is body!" only!? Please guys, help me with this matter!
For example, HTML is replaced by this one:
<table width=90% align=center>
<tr>
<td>The information available on this site freely accessible to the public</td>
<tr>
</table>
<table class=adminlist border=0 width=90% cellpadding=3 cellspacing=0 align=center>
<tr>
<td rowspan=5 colspan=2><img src=images/Forum.png><br></td>
</tr>
<tr>
<td><i><b>Phone</b></td><td>: +61 2 4446 5552</td>
</tr>
<tr>
<td><i><b>Name</b></td><td>: Stundet</td>
</tr>
<tr>
<td><i><b>Class</b></td>
<td>: Summer</td>
</tr>
<tr>
<td><i><b>Email</b></td>
<td>: [email protected]</td>
</tr>
</table>
and i want to make this output:
Phone : +61 2 4446 5552
Name : Student
Class : Summer
Email : [email protected]
only caught the words the core of the html.. :)
Upvotes: 3
Views: 371
Reputation: 35039
Try beautiful soup.
from BeautifulSoup import BeautifulSoup
...
soup = BeautifulSoup(html)
soup.findAll("body").string
Upvotes: 6
Reputation: 5604
Use BeautifulSoup:
from BeautifulSoup import BeautifulSoup
html = """
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252" />
<title>Untitled Document</title>
</head>
<body>
This is body!
</body>
</html>
"""
soup = BeautifulSoup(html)
print soup.find('body').string
Upvotes: 4