Lzypenguin
Lzypenguin

Reputation: 955

How to grab specifically what I need using BeautifulSoup

I am scraping a website and pull info from multiple spots on the site, and the html looks like this:

<div class="Item-Details">
    <p class="Product-title">
        <a href="/link_i_need">
            text here that i need to grab
            more text here that i would like to grab
        </a>
    </p>

I am using this:

soup = BeautifulSoup(html, 'lxml')
mydivs = soup.findAll("p", {"class": "product-title"})
for div in mydivs:
    print(div)

But it returns this:

<p class="product-title">
<a href="/info">line 1 description as well as line 2 description with no break</a>
</p>

I need to return the href portion in quotes, as well as both lines separately. I have tried using these but neither works:

print(div.get('href'))
print(div.find('a'))

Any help is appreciated.

Upvotes: 1

Views: 31

Answers (2)

baduker
baduker

Reputation: 20042

Well, first of all, you're missing the closing tag </div>. Then, you have a typo. It's "Product-title" not "product-title". Finally, looping over your divs doesn't get you any closer to your desired output.

So, assuming your HTML looks like this:

sample = """
<div class="Item-Details">
    <p class="Product-title">
        <a href="/link_i_need">
            text here that i need to grab
            more text here that i would like to grab
        </a>
    </p>
</div>
"""

You could try this:

soup = BeautifulSoup(sample, "html.parser").find_all("p", {"class": "Product-title"})
for stuff in soup:
    print(f"{stuff.find('a').get('href')}\n{stuff.find('a').getText(strip=True)}")

To get this:

/link_i_need
text here that i need to grab
            more text here that i would like to grab

Upvotes: 1

Axiumin_
Axiumin_

Reputation: 2145

After getting the div tag, you can get the href attribute of the a tag by doing this: div.find("a")['href']. So for your code, it'd look like this:

soup = BeautifulSoup(html, 'lxml')
mydivs = soup.findAll("p", {"class": "product-title"})
for div in mydivs:
    print(div.find("a")['href'])

Note that this will error out if any of the elements do not have a href attribute.

For the text inside, you can use the .text property, like this:

soup = BeautifulSoup(html, 'lxml')
mydivs = soup.findAll("p", {"class": "product-title"})
for div in mydivs:
    print(div.find("a").text)

Upvotes: 1

Related Questions