Unable to extract some address out of some html elements

I've written a script in python to scrape the address out of a chunk of html elements. The address are within couple of br tags. However, when I run my script I get this [<br/>, <br/>, <br/>, <br/>] as output.

How can I get the full address?

The html elements I'm trying to collect address from:

<div class="ACA_TabRow ACA_FLeft">
 Mailing
 <br/>
 1961 MAIN ST #186
 <br/>
 WATSONVILLE, CA, 95076
 <br/>
 United States
 <br/>
</div>

I've tried so far:

from bs4 import BeautifulSoup
import re

html = """
<div class="ACA_TabRow ACA_FLeft">
 Mailing
 <br/>
 1961 MAIN ST #186
 <br/>
 WATSONVILLE, CA, 95076
 <br/>
 United States
 <br/>
</div>
"""
soup = BeautifulSoup(html,"lxml")
items = soup.find(class_="ACA_TabRow").find(string=re.compile("Mailing")).find_next_siblings()
print(items)

Upvotes: 0

Answers (3)

robots.txt

Reputation: 137

It seems I've found a better solution:

from bs4 import BeautifulSoup
import re

html = """
<div class="ACA_TabRow ACA_FLeft">
 Mailing
 <br/>
 1961 MAIN ST #186
 <br/>
 WATSONVILLE, CA, 95076
 <br/>
 United States
 <br/>
</div>
"""
soup = BeautifulSoup(html,"lxml")
items = soup.find(class_="ACA_TabRow").find(string=re.compile("Mailing")).find_parent()
find_text = ' '.join([item.strip() for item in items.strings])
print(find_text)

Output:

Mailing 1961 MAIN ST #186 WATSONVILLE, CA, 95076 United States

Upvotes: 0

mad_

Reputation: 8273

I will keep check if stripped string inside the div startswith Mailing

soup = BeautifulSoup(html,"lxml")
items = soup.find(class_="ACA_TabRow")

for i,item in enumerate(items.stripped_strings):
    if i==0 and not item.startswith('Mailing'):
        break
    if i!=0:
        print(item)

Output

1961 MAIN ST #186
WATSONVILLE, CA, 95076
United States

Upvotes: 2

chitown88

Reputation: 28630

from bs4 import BeautifulSoup
import re

html = """
<div class="ACA_TabRow ACA_FLeft">
 Mailing
 <br/>
 1961 MAIN ST #186
 <br/>
 WATSONVILLE, CA, 95076
 <br/>
 United States
 <br/>
</div>
"""
soup = BeautifulSoup(html,"lxml")
items = soup.find(class_="ACA_TabRow")

items_list = items.text.split('\n')

results = [ x.strip() for x in items_list if x.strip() != '' ]

Output:

print (results)
['Mailing', '1961 MAIN ST #186', 'WATSONVILLE, CA, 95076', 'United States']

Upvotes: 0

Unable to extract some address out of some html elements

Answers (3)

Related Questions