MITHU
MITHU

Reputation: 164

Unable to scoop out desired portion of address out of long ones

I'm trying to scrape addresses out of some html elements using BeautifulSoup library. My intention is to grab the addresses up to the last County. The problem I'm facing here is that there are two County in all the addresses, so I can't make my script work.

The sources of the three addresses:

<div class="col_biz bizgrid_hdr_address">
<strong>Business Address:</strong><br>
650 County Road 375<br><a title="YOUnity Clothing located in Jarrell, TX " href="/tx/williamson/jarrell">Jarrell</a>, <a title="YOUnity Clothing located in TX " href="/tx">TX</a> 76537<br><a title="YOUnity Clothing located in Williamson County, TX " href="/tx/williamson">Williamson County</a><br><br><div class="bizbtn"><a title="YOUnity Clothing" href="http://www.younityclothing.com" rel="nofollow" target="_blank">YOUnity Clothing Website</a></div>
</div>


<div class="col_biz bizgrid_hdr_address">
<strong>Business Address:</strong><br>
39829 County Road 452<br><a title="Eco Sciences, LLC located in Leesburg, FL " href="/fl/lake/leesburg">Leesburg</a>, <a title="Eco Sciences, LLC located in FL " href="/fl">FL</a> 32788<br><a title="Eco Sciences, LLC located in Lake County, FL " href="/fl/lake">Lake County</a><br><br><div class="bizbtn"><a title="Eco Sciences, LLC" href="http://www.ecosciencesllc.com/" rel="nofollow" target="_blank">Eco Sciences, LLC Website</a></div>
</div>


<div class="col_biz bizgrid_hdr_address">
<strong>Business Address:</strong><br>
28 County Road 884<br><a title="R&amp;P Painting located in Rainsville, AL " href="/al/dekalb/rainsville">Rainsville</a>, <a title="R&amp;P Painting located in AL " href="/al">AL</a> 35986<br><a title="R&amp;P Painting located in DeKalb County, AL " href="/al/dekalb">DeKalb County</a>
</div>

This is how they are in there:

['', 'Business Address:', '39829 County Road 452', 'Leesburg', ',', 'FL', '32788', 'Lake County', 'Eco Sciences, LLC Website', '']

['', 'Business Address:', '28 County Road 884', 'Rainsville', ',', 'AL', '35986', 'DeKalb County', '']

['', 'Business Address:', '650 County Road 375', 'Jarrell', ',', 'TX', '76537', 'Williamson County', 'YOUnity Clothing Website', '']

Expected output:

Business Address: 39829 County Road 452 Leesburg , FL 32788
Business Address: 28 County Road 884 Rainsville , AL 35986
Business Address: 650 County Road 375 Jarrell , TX 76537

I've tried so far:

from bs4 import BeautifulSoup

html = """
<div class="col_biz bizgrid_hdr_address">
<strong>Business Address:</strong><br>
39829 County Road 452<br><a title="Eco Sciences, LLC located in Leesburg, FL " href="/fl/lake/leesburg">Leesburg</a>, <a title="Eco Sciences, LLC located in FL " href="/fl">FL</a> 32788<br><a title="Eco Sciences, LLC located in Lake County, FL " href="/fl/lake">Lake County</a><br><br><div class="bizbtn"><a title="Eco Sciences, LLC" href="http://www.ecosciencesllc.com/" rel="nofollow" target="_blank">Eco Sciences, LLC Website</a></div>
</div>
"""
soup = BeautifulSoup(html,"lxml")
address = []
for i in soup.select_one(".bizgrid_hdr_address"):
    if not i.string:continue
    if 'County' in i.string.strip():break
    address.append(i.string.strip())
print(' '.join(address).strip())

Unfortunately the above attempt produces only Business Address: because it encounter the first County and breaks the loop whereas my goal here is to grab up to the last County.

How can I capture the desired portion of address?

Upvotes: 0

Views: 67

Answers (3)

baduker
baduker

Reputation: 20052

I'm not sure if that's gonna work for a larger part of the HTML, but there's the word Website in every anchor, so you can filter by that.

For example:

from bs4 import BeautifulSoup

html = """<div class="col_biz bizgrid_hdr_address">
<strong>Business Address:</strong><br>
650 County Road 375<br><a title="YOUnity Clothing located in Jarrell, TX " href="/tx/williamson/jarrell">Jarrell</a>, <a title="YOUnity Clothing located in TX " href="/tx">TX</a> 76537<br><a title="YOUnity Clothing located in Williamson County, TX " href="/tx/williamson">Williamson County</a><br><br><div class="bizbtn"><a title="YOUnity Clothing" href="http://www.younityclothing.com" rel="nofollow" target="_blank">YOUnity Clothing Website</a></div>
</div>


<div class="col_biz bizgrid_hdr_address">
<strong>Business Address:</strong><br>
39829 County Road 452<br><a title="Eco Sciences, LLC located in Leesburg, FL " href="/fl/lake/leesburg">Leesburg</a>, <a title="Eco Sciences, LLC located in FL " href="/fl">FL</a> 32788<br><a title="Eco Sciences, LLC located in Lake County, FL " href="/fl/lake">Lake County</a><br><br><div class="bizbtn"><a title="Eco Sciences, LLC" href="http://www.ecosciencesllc.com/" rel="nofollow" target="_blank">Eco Sciences, LLC Website</a></div>
</div>


<div class="col_biz bizgrid_hdr_address">
<strong>Business Address:</strong><br>
28 County Road 884<br><a title="R&amp;P Painting located in Rainsville, AL " href="/al/dekalb/rainsville">Rainsville</a>, <a title="R&amp;P Painting located in AL " href="/al">AL</a> 35986<br><a title="R&amp;P Painting located in DeKalb County, AL " href="/al/dekalb">DeKalb County</a>
</div>
"""
output = []
for div in BeautifulSoup(html, "lxml").select(".bizgrid_hdr_address"):
    for item in div:
        if item.string and item.string.strip():
            text = item.string.strip()
            if "Website" in text:
                continue
            output.append(text)

addresses = [output[i:i+7] for i in range(0, len(output), 7)]
for address in addresses:
    print(" ".join(address).replace(" ,", ","))

This gets you:

Business Address: 650 County Road 375 Jarrell, TX 76537 Williamson County
Business Address: 39829 County Road 452 Leesburg, FL 32788 Lake County
Business Address: 28 County Road 884 Rainsville, AL 35986 DeKalb County

Upvotes: 1

b = "2356"
for x in soup.select(".col_biz"):
    x = [i.strip() for i in list(x.strings)]
    goal = [x[int(c)] for c in b]
    print(*goal)

Output:

650 County Road 375 Jarrell TX 76537
39829 County Road 452 Leesburg FL 32788
28 County Road 884 Rainsville AL 35986

Or

goal = [(x.contents[3].strip(), x.contents[5]['title'].split("in ")[-1].strip())
        for x in soup.select(".col_biz")]

Output:

[('650 County Road 375', 'Jarrell, TX'), ('39829 County Road 452', 'Leesburg, FL'), ('28 County Road 884', 'Rainsville, AL')]

Upvotes: 1

kirabin
kirabin

Reputation: 40

Haven't checked the code, but trying to give an idea to use some sort of flag. First encounter will change flag to 1. And second encounter will break from the loop.

...
soup = BeautifulSoup(html,"lxml")
address = []

flag = 0
for i in soup.select_one(".bizgrid_hdr_address"):
    if not i.string:continue
    if 'County' in i.string.strip() and flag:
        break
    if 'County' in i.string.strip(): 
        flag = 1
    address.append(i.string.strip())
print(' '.join(address).strip())

Upvotes: 1

Related Questions