Ryflex
Ryflex

Reputation: 5769

Using beautifulsoup in python to get link names and "selecting" links instead of limiting?

I've got the following code trying to return data from some html, however I am unable to return what I require...

import urllib2
from bs4 import BeautifulSoup
from time import sleep

def getData():
    htmlfile = open('C:/html.html', 'rb')
    html = htmlfile.read()
    soup = BeautifulSoup(html)
    items = soup.find_all('div', class_="blocks")
    for item in items:
        links = item.find_all('h3')
        for link in links:
            print link

getData()

Returns the a list of following:

<h3>
    <a href="http://www.mywebsite.com/titles" title="Click for details(x)">
    TITLE STUFF HERE (YES)
    </a>
</h3>

<h3>
    <a href="http://www.mywebsite.com/titles" title="Click for details(x)">
    TITLE STUFF HERE (MAYBE)
    </a>
</h3>

I want to be able to return just the title: TITLE STUFF HERE (YES) and TITLE STUFF HERE (MAYBE)

Another thing I want to be able to do to use the soup.find_all("a", limit=2) function but instead of "limit" and instead of returning two results only I want it to return ONLY the second link... so a select feature not a limit? (Does such a feature exist?)

Upvotes: 2

Views: 2475

Answers (1)

prgao
prgao

Reputation: 1787

import urllib2
from bs4 import BeautifulSoup
from time import sleep

def getData():
    htmlfile = open('C:/html.html', 'rb')
    html = htmlfile.read()
    soup = BeautifulSoup(html)
    items = soup.find_all('div', class_="blocks")
    for item in items:
        links = item.find_all('a')
        for link in links:
            if link.parent.name == 'h3':
                print(link.text)

getData()

You can also just find all the links from the very beginning and check both the parent is h3 and the parent's parent is a div with class blocks

Upvotes: 5

Related Questions