Reputation: 1659
I need to scrape small business information from a public site
This is the html format
<div class="listings">
<ul>
<li>
<h3>Machine Machine Company Inc</h3>
</li>
<li><a href="#government_funding" style="font-size:.8em;">View funding actions</a></li>
<li>Alexandria, AL 36250</li>
<li><span style="font-weight: bold;">Phone</span>: 256-830-3440</li>
<li><span style="font-weight: bold;">Estimated Number of Employees</span>: 64</li>
<li><span style="font-weight: bold;">Estimated Annual Receipts</span>: $9,691,383</li>
<li><span style="font-weight: bold;">Business Start Date</span>: 1971</li>
</ul>
<ul>
<li><span style="font-weight: bold;">Contact Person</span>: James HOland</li>
<li><span style="font-weight: bold;">Contact Phone</span>: 256-820-3440</li>
<li><span style="font-weight: bold;">Contact Email</span>: [email protected]</li>
</ul>
<ul>
<li><span style="font-weight: bold;">Business Structure</span>:</li>
<li>Corporate Entity (Not Tax Exempt)</li>
</ul>
<ul>
<li><span style="font-weight: bold;">Business Type</span>:</li>
<li>For Profit Organization</li>
<li>Manufacturer of Goods</li>
</ul>
<ul>
<li><span style="font-weight: bold;">Industries Served</span>: All Other Miscellaneous Fabricated Metal Product Manufacturing, All Other Miscellaneous General Purpose Machinery Manufacturing</li>
</ul>
<div style="padding-top: 10px;" id="government_funding">
<h2>Sampling of Recent Funding Actions/Set Asides</h2>
<p style="font-style: italic; font-size: .8em;">In order by amount of set aside monies.</p>
<ul>
<li><span style="color: green;">$500,000</span> - Tuesday the 29th of November 2016<br><span style="font-weight: bold; font-size: 1.2em;">Department Of Army</span> <br> W0LX ANNISTON DEPOT PROP DIV<br>IGF::CT::IGF. INCREASE FUNDING FOR THE ABRASIVE CLEAN OF VARIOUS PARTS
<hr>
</li>
</ul>
</div>
</div>
My plan on how to extract data is to get all the "ul" tags into a container, then iterate through all the uls in the container based on the index number find desired text (i.e. email). So I have this python script attempting to retrieve the email address:
companydriver.get(weburl)
businessesoup = BeautifulSoup(companydriver.page_source,"html5lib");
#GET BUSINESS DATA
businesscontainer = businessesoup.find_all("ul")
dataresult = [c for c in businesscontainer]
print(colorama.Fore.BLUE + str(dataresult))
for idx, datacell in enumerate(dataresult, start=0):
# arraylenght = dataresult.lenght
# print("this is dataresult", dataresult)
print("Index ", str(idx))
print(colorama.Fore.RED +'This is data cell',str(datacell))
print(" ")
if (idx == 1) :
emailaddress = dataresult.find("span").text
print(colorama.Fore.GREEN + str(emailaddress))
The problem is I can't seem to get the email address.
I need to extract these items:
How could I easily extract the email address and the rest?
Upvotes: 0
Views: 75
Reputation: 1605
You can use a RE to find the string you are looking for then get the parent of that object :
Edit
Explanation: Using the text = re.recomiple command we are able to apply regex expressions to the text values of our beautiful soup object. In this case we are interested in span tags. So since we know the text in the html we can apply multiple statements through regex. The ^ operator in regex is going to match a string value, the () is going to sub expression or matching group. So I applied each of your criteria as a matching group and the | (bar) symbol as a logical or condition.
from bs4 import BeautifulSoup
import re
html = """
<div class="listings">
<ul>
<li>
<h3>Machine Machine Company Inc</h3>
</li>
<li><a href="#government_funding" style="font-size:.8em;">View funding actions</a></li>
<li>Alexandria, AL 36250</li>
<li><span style="font-weight: bold;">Phone</span>: 256-830-3440</li>
<li><span style="font-weight: bold;">Estimated Number of Employees</span>: 64</li>
<li><span style="font-weight: bold;">Estimated Annual Receipts</span>: $9,691,383</li>
<li><span style="font-weight: bold;">Business Start Date</span>: 1971</li>
</ul>
<ul>
<li><span style="font-weight: bold;">Contact Person</span>: James HOland</li>
<li><span style="font-weight: bold;">Contact Phone</span>: 256-820-3440</li>
<li><span style="font-weight: bold;">Contact Email</span>: [email protected]</li>
</ul>
<ul>
<li><span style="font-weight: bold;">Business Structure</span>:</li>
<li>Corporate Entity (Not Tax Exempt)</li>
</ul>
<ul>
<li><span style="font-weight: bold;">Business Type</span>:</li>
<li>For Profit Organization</li>
<li>Manufacturer of Goods</li>
</ul>
<ul>
<li><span style="font-weight: bold;">Industries Served</span>: All Other Miscellaneous Fabricated Metal Product Manufacturing, All Other Miscellaneous General Purpose Machinery Manufacturing</li>
</ul>
<div style="padding-top: 10px;" id="government_funding">
<h2>Sampling of Recent Funding Actions/Set Asides</h2>
<p style="font-style: italic; font-size: .8em;">In order by amount of set aside monies.</p>
<ul>
<li><span style="color: green;">$500,000</span> - Tuesday the 29th of November 2016<br><span style="font-weight: bold; font-size: 1.2em;">Department Of Army</span> <br> W0LX ANNISTON DEPOT PROP DIV<br>IGF::CT::IGF. INCREASE FUNDING FOR THE ABRASIVE CLEAN OF VARIOUS PARTS
<hr>
</li>
</ul>
</div>
</div>
"""
bs = BeautifulSoup(html,'lxml')
for li in bs.find_all('span',text=re.compile('^(Contact Email)|^(Business Type)|^(Phone)|^(Estimated Number of Employees)|^(Estimated Annual Receipts)|^(Contact Person)|^(Industries Served)|^(Department Of Army)')):
print(li.parent.text)
Upvotes: 0
Reputation: 313
You can try using the text directly as the find_all argument. https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Example:
strings_to_search_for = ["Phone", "Estimated Number of Employees"]
businesscontainer = businessesoup.find_all(string=strings_to_search_for )
for element in businesscontainer:
value = element.parent.text # get <li> value
# do something ...
Hope it helps.
Upvotes: 1
Reputation: 1165
What you are doing now would not work, as you are extracting text from a <span>
element, while the information you are after is in a <li>
element in which a <span>
is contained. I would suggest you do the following:
For each <li>
element:
<span>
element and if so, what the text of that element is.<span>
element, with, for example, the text "Contact Email", you know that the <li>
element contains the information you need.<li>
element with information you need, you can extract its text contents. This will probably also contain the (for example) "Contact Email" text, so you will need to do some post-processing, but that is not the hardest part of this whole quest.EDIT: Code
Based on your code, you would probably do something like the following to extract the email address (Note: not guaranteed to work, but that is not the point)
soup = BeautifulSoup(...);
for li in soup.find_all("li"):
span = li.find("span")
if span is None:
continue
if span.get_text() == "Contact Email":
print("Found email: " + str(li.get_text()))
// Now all you have to do is extract the address from the text of the <li> tag
Upvotes: 1