Jed Christiansen
Jed Christiansen

Reputation: 669

BeautifulSoup - get attributes on the div's I'm iterating over

I'm using BeautifulSoup to parse lists of companies from VC websites. I've found the right elements to iterate over, but I can't seem to get data on those elements themselves.

Here's the sample HTML I'm going through:

<div id="content" class="site-content">
    <main id="primary" class="content-area" role="main">
        <header class="page-header">
        <main id="portfolio-landing-company-list" class="page-content">
            <section id="portfolio__list--grid" class="portfolio__list--all">
            <div class="company company-stage--venturegrowth company-type--enterprise company--single-company">
                    <div class="company__thumbnail company__thumbnail-link">
                        <a href="http://www.domain1.com" title="Company1" target="_blank">
                    </div>      
            </div>
            <div class="company company-stage--seed company-type--bio company--single-company">
                    <div class="company__thumbnail company__thumbnail-link">
                        <a href="http://www.domain2.com" title="Company2" target="_blank">
                    </div>
            </div>

This is how I'm currently using BeautifulSoup and this part is working great:

portfolio = soup.find('div', attrs={'class': 'portfolio-tiles'})
for eachco in portfolio.find_all('article'):
  companyname = eachco.a['title']
  companyurl = eachco.a['href']

But what I want to do is grab the class elements from here:

<div class="company company-stage--venturegrowth company-type--enterprise company--single-company">
or
<div class="company company-stage--seed company-type--bio company--single-company">

(there are multiple variations for each company in the list)

I've tried iterating through with:

portfolio = soup.find('div', attrs={'class': 'portfolio-tiles'})
for eachco in portfolio.find_all('article'):
  companyattributes = eachco.div['class']

but that spits out rows of:

['company__thumbnail', 'company__thumbnail-link']

(aka, a level below what I'm looking for)

How can I iterate over all of the results but get class elements for each result? I sense I'm missing something really basic, but would appreciate any help figuring out what that thing is!

UPDATE

I ended up going with the following, which got everything working together:

portfolio = soup.find_all('div', class_=re.compile("company company-"))
    for eachco in portfolio:
        coname = eachco.a['title']
        courl = eachco.a['href']
        cotypes = eachco['class']
        costage = cotypes[1]
        comarket = cotypes[2]

Upvotes: 1

Views: 125

Answers (2)

Jack Fleeting
Jack Fleeting

Reputation: 24940

I think this is what you're looking for:

for i in range(len(soup)):
     print(soup.select('div[class*="stage"]')[i].attrs['class'])

Output

   ['company', 'company-stage--venturegrowth', 'company-type--enterprise', 'company--single-company']
   ['company', 'company-stage--seed', 'company-type--bio', 'company--single-company']y--single-company']

Upvotes: 1

KunduK
KunduK

Reputation: 33384

You can use re module to find particular text in class element.

from bs4 import BeautifulSoup
import re
html = """<html><div id="content" class="site-content">
    <main id="primary" class="content-area" role="main">
        <header class="page-header">
        <main id="portfolio-landing-company-list" class="page-content">
            <section id="portfolio__list--grid" class="portfolio__list--all">
            <div class="company company-stage--venturegrowth company-type--enterprise company--single-company">
                    <div class="company__thumbnail company__thumbnail-link">(
                        <a href="http://www.domain1.com" title="Company1" target="_blank">
                    </div>
            </div>
            <div class="company company-stage--venturegrowth company-type--enterprise company--single-company">
                    <div class="company__thumbnail company__thumbnail-link">
                        <a href="http://www.domain2.com" title="Company2" target="_blank">
                    </div>
            </div> </html>"""

soup = BeautifulSoup(html, 'html.parser')
divs = soup.find_all('div' ,class_=re.compile("stage"))
for div in divs:
    print(div['class'])

Output :

[u'company', u'company-stage--venturegrowth', u'company-type--enterprise', u'company--single-company']
[u'company', u'company-stage--venturegrowth', u'company-type--enterprise', u'company--single-company']

Upvotes: 1

Related Questions