e.iluf
e.iluf

Reputation: 1659

Could not successfully extract text from site html

I need to scrape small business information from a public site

This is the html format

<div class="listings">
    <ul>
        <li>
            <h3>Machine Machine Company Inc</h3>
        </li>
        <li><a href="#government_funding" style="font-size:.8em;">View funding actions</a></li>
        <li>Alexandria, AL 36250</li>
        <li><span style="font-weight: bold;">Phone</span>: 256-830-3440</li>
        <li><span style="font-weight: bold;">Estimated Number of Employees</span>: 64</li>
        <li><span style="font-weight: bold;">Estimated Annual Receipts</span>: $9,691,383</li>
        <li><span style="font-weight: bold;">Business Start Date</span>: 1971</li>
    </ul>
    <ul>
        <li><span style="font-weight: bold;">Contact Person</span>: James HOland</li>
        <li><span style="font-weight: bold;">Contact Phone</span>: 256-820-3440</li>
        <li><span style="font-weight: bold;">Contact Email</span>: [email protected]</li>
    </ul>
    <ul>
        <li><span style="font-weight: bold;">Business Structure</span>:</li>
        <li>Corporate Entity (Not Tax Exempt)</li>
    </ul>
    <ul>
        <li><span style="font-weight: bold;">Business Type</span>:</li>
        <li>For Profit Organization</li>
        <li>Manufacturer of Goods</li>
    </ul>
    <ul>
        <li><span style="font-weight: bold;">Industries Served</span>: All Other Miscellaneous Fabricated Metal Product Manufacturing, All Other Miscellaneous General Purpose Machinery Manufacturing</li>
    </ul>
    <div style="padding-top: 10px;" id="government_funding">
        <h2>Sampling of Recent  Funding Actions/Set Asides</h2>
        <p style="font-style: italic; font-size: .8em;">In order by amount of set aside monies.</p>
        <ul>
            <li><span style="color: green;">$500,000</span> - Tuesday the 29th of November 2016<br><span style="font-weight: bold; font-size: 1.2em;">Department Of Army</span> <br> W0LX ANNISTON DEPOT PROP DIV<br>IGF::CT::IGF. INCREASE FUNDING FOR THE ABRASIVE CLEAN OF VARIOUS PARTS
                <hr>
            </li>
        </ul>
    </div>
</div>

My plan on how to extract data is to get all the "ul" tags into a container, then iterate through all the uls in the container based on the index number find desired text (i.e. email). So I have this python script attempting to retrieve the email address:

companydriver.get(weburl)

businessesoup = BeautifulSoup(companydriver.page_source,"html5lib");

#GET BUSINESS DATA
businesscontainer = businessesoup.find_all("ul")

dataresult = [c for c in businesscontainer]

print(colorama.Fore.BLUE +  str(dataresult))

for idx, datacell in enumerate(dataresult, start=0):
    # arraylenght = dataresult.lenght
    # print("this is dataresult", dataresult)
    print("Index ", str(idx))
    print(colorama.Fore.RED +'This is data cell',str(datacell))
    print(" ")

    if (idx == 1)  :
        emailaddress = dataresult.find("span").text
        print(colorama.Fore.GREEN + str(emailaddress))

The problem is I can't seem to get the email address.

I need to extract these items:

How could I easily extract the email address and the rest?

Upvotes: 0

Views: 75

Answers (3)

Ian-Fogelman
Ian-Fogelman

Reputation: 1605

You can use a RE to find the string you are looking for then get the parent of that object :

Edit

Explanation: Using the text = re.recomiple command we are able to apply regex expressions to the text values of our beautiful soup object. In this case we are interested in span tags. So since we know the text in the html we can apply multiple statements through regex. The ^ operator in regex is going to match a string value, the () is going to sub expression or matching group. So I applied each of your criteria as a matching group and the | (bar) symbol as a logical or condition.

http://rextester.com/KBB57950

from bs4 import BeautifulSoup
import re

html = """
<div class="listings">
    <ul>
        <li>
            <h3>Machine Machine Company Inc</h3>
        </li>
        <li><a href="#government_funding" style="font-size:.8em;">View funding actions</a></li>
        <li>Alexandria, AL 36250</li>
        <li><span style="font-weight: bold;">Phone</span>: 256-830-3440</li>
        <li><span style="font-weight: bold;">Estimated Number of Employees</span>: 64</li>
        <li><span style="font-weight: bold;">Estimated Annual Receipts</span>: $9,691,383</li>
        <li><span style="font-weight: bold;">Business Start Date</span>: 1971</li>
    </ul>
    <ul>
        <li><span style="font-weight: bold;">Contact Person</span>: James HOland</li>
        <li><span style="font-weight: bold;">Contact Phone</span>: 256-820-3440</li>
        <li><span style="font-weight: bold;">Contact Email</span>: [email protected]</li>
    </ul>
    <ul>
        <li><span style="font-weight: bold;">Business Structure</span>:</li>
        <li>Corporate Entity (Not Tax Exempt)</li>
    </ul>
    <ul>
        <li><span style="font-weight: bold;">Business Type</span>:</li>
        <li>For Profit Organization</li>
        <li>Manufacturer of Goods</li>
    </ul>
    <ul>
        <li><span style="font-weight: bold;">Industries Served</span>: All Other Miscellaneous Fabricated Metal Product Manufacturing, All Other Miscellaneous General Purpose Machinery Manufacturing</li>
    </ul>
    <div style="padding-top: 10px;" id="government_funding">
        <h2>Sampling of Recent  Funding Actions/Set Asides</h2>
        <p style="font-style: italic; font-size: .8em;">In order by amount of set aside monies.</p>
        <ul>
            <li><span style="color: green;">$500,000</span> - Tuesday the 29th of November 2016<br><span style="font-weight: bold; font-size: 1.2em;">Department Of Army</span> <br> W0LX ANNISTON DEPOT PROP DIV<br>IGF::CT::IGF. INCREASE FUNDING FOR THE ABRASIVE CLEAN OF VARIOUS PARTS
                <hr>
            </li>
        </ul>
    </div>
</div>
"""

bs = BeautifulSoup(html,'lxml')
for li in bs.find_all('span',text=re.compile('^(Contact Email)|^(Business Type)|^(Phone)|^(Estimated Number of Employees)|^(Estimated Annual Receipts)|^(Contact Person)|^(Industries Served)|^(Department Of Army)')):
    print(li.parent.text)

Upvotes: 0

FedOpp
FedOpp

Reputation: 313

You can try using the text directly as the find_all argument. https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Example:

strings_to_search_for = ["Phone", "Estimated Number of Employees"]

businesscontainer = businessesoup.find_all(string=strings_to_search_for )
for element in businesscontainer:
   value = element.parent.text  # get <li> value
   # do something ...

Hope it helps.

Upvotes: 1

Tempestas Ludi
Tempestas Ludi

Reputation: 1165

What you are doing now would not work, as you are extracting text from a <span> element, while the information you are after is in a <li> element in which a <span> is contained. I would suggest you do the following:

For each <li> element:

  • Check whether it contains a <span> element and if so, what the text of that element is.
  • If there is indeed a <span> element, with, for example, the text "Contact Email", you know that the <li> element contains the information you need.
  • If you found a <li> element with information you need, you can extract its text contents. This will probably also contain the (for example) "Contact Email" text, so you will need to do some post-processing, but that is not the hardest part of this whole quest.

EDIT: Code

Based on your code, you would probably do something like the following to extract the email address (Note: not guaranteed to work, but that is not the point)

soup = BeautifulSoup(...);
for li in soup.find_all("li"):
  span = li.find("span")
  if span is None:
    continue
  if span.get_text() == "Contact Email":
    print("Found email: " + str(li.get_text()))
    // Now all you have to do is extract the address from the text of the <li> tag

Upvotes: 1

Related Questions