Reputation: 669
I'm using BeautifulSoup to parse lists of companies from VC websites. I've found the right elements to iterate over, but I can't seem to get data on those elements themselves.
Here's the sample HTML I'm going through:
<div id="content" class="site-content">
<main id="primary" class="content-area" role="main">
<header class="page-header">
<main id="portfolio-landing-company-list" class="page-content">
<section id="portfolio__list--grid" class="portfolio__list--all">
<div class="company company-stage--venturegrowth company-type--enterprise company--single-company">
<div class="company__thumbnail company__thumbnail-link">
<a href="http://www.domain1.com" title="Company1" target="_blank">
</div>
</div>
<div class="company company-stage--seed company-type--bio company--single-company">
<div class="company__thumbnail company__thumbnail-link">
<a href="http://www.domain2.com" title="Company2" target="_blank">
</div>
</div>
This is how I'm currently using BeautifulSoup and this part is working great:
portfolio = soup.find('div', attrs={'class': 'portfolio-tiles'})
for eachco in portfolio.find_all('article'):
companyname = eachco.a['title']
companyurl = eachco.a['href']
But what I want to do is grab the class elements from here:
<div class="company company-stage--venturegrowth company-type--enterprise company--single-company">
or
<div class="company company-stage--seed company-type--bio company--single-company">
(there are multiple variations for each company in the list)
I've tried iterating through with:
portfolio = soup.find('div', attrs={'class': 'portfolio-tiles'})
for eachco in portfolio.find_all('article'):
companyattributes = eachco.div['class']
but that spits out rows of:
['company__thumbnail', 'company__thumbnail-link']
(aka, a level below what I'm looking for)
How can I iterate over all of the results but get class elements for each result? I sense I'm missing something really basic, but would appreciate any help figuring out what that thing is!
UPDATE
I ended up going with the following, which got everything working together:
portfolio = soup.find_all('div', class_=re.compile("company company-"))
for eachco in portfolio:
coname = eachco.a['title']
courl = eachco.a['href']
cotypes = eachco['class']
costage = cotypes[1]
comarket = cotypes[2]
Upvotes: 1
Views: 125
Reputation: 24940
I think this is what you're looking for:
for i in range(len(soup)):
print(soup.select('div[class*="stage"]')[i].attrs['class'])
Output
['company', 'company-stage--venturegrowth', 'company-type--enterprise', 'company--single-company']
['company', 'company-stage--seed', 'company-type--bio', 'company--single-company']y--single-company']
Upvotes: 1
Reputation: 33384
You can use re
module to find particular text in class element.
from bs4 import BeautifulSoup
import re
html = """<html><div id="content" class="site-content">
<main id="primary" class="content-area" role="main">
<header class="page-header">
<main id="portfolio-landing-company-list" class="page-content">
<section id="portfolio__list--grid" class="portfolio__list--all">
<div class="company company-stage--venturegrowth company-type--enterprise company--single-company">
<div class="company__thumbnail company__thumbnail-link">(
<a href="http://www.domain1.com" title="Company1" target="_blank">
</div>
</div>
<div class="company company-stage--venturegrowth company-type--enterprise company--single-company">
<div class="company__thumbnail company__thumbnail-link">
<a href="http://www.domain2.com" title="Company2" target="_blank">
</div>
</div> </html>"""
soup = BeautifulSoup(html, 'html.parser')
divs = soup.find_all('div' ,class_=re.compile("stage"))
for div in divs:
print(div['class'])
Output :
[u'company', u'company-stage--venturegrowth', u'company-type--enterprise', u'company--single-company']
[u'company', u'company-stage--venturegrowth', u'company-type--enterprise', u'company--single-company']
Upvotes: 1