Reputation: 5769
I've got the following code trying to return data from some html, however I am unable to return what I require...
import urllib2
from bs4 import BeautifulSoup
from time import sleep
def getData():
htmlfile = open('C:/html.html', 'rb')
html = htmlfile.read()
soup = BeautifulSoup(html)
items = soup.find_all('div', class_="blocks")
for item in items:
links = item.find_all('h3')
for link in links:
print link
getData()
Returns the a list of following:
<h3>
<a href="http://www.mywebsite.com/titles" title="Click for details(x)">
TITLE STUFF HERE (YES)
</a>
</h3>
<h3>
<a href="http://www.mywebsite.com/titles" title="Click for details(x)">
TITLE STUFF HERE (MAYBE)
</a>
</h3>
I want to be able to return just the title: TITLE STUFF HERE (YES)
and TITLE STUFF HERE (MAYBE)
Another thing I want to be able to do to use the
soup.find_all("a", limit=2)
function but instead of "limit" and instead of returning two results only I want it to return ONLY the second link... so a select feature not a limit? (Does such a feature exist?)
Upvotes: 2
Views: 2475
Reputation: 1787
import urllib2
from bs4 import BeautifulSoup
from time import sleep
def getData():
htmlfile = open('C:/html.html', 'rb')
html = htmlfile.read()
soup = BeautifulSoup(html)
items = soup.find_all('div', class_="blocks")
for item in items:
links = item.find_all('a')
for link in links:
if link.parent.name == 'h3':
print(link.text)
getData()
You can also just find all the links from the very beginning and check both the parent is h3 and the parent's parent is a div with class blocks
Upvotes: 5