Reputation: 137
I've written a script in python to scrape the address out of a chunk of html elements. The address are within couple of br
tags. However, when I run my script I get this [<br/>, <br/>, <br/>, <br/>]
as output.
How can I get the full address?
The html elements I'm trying to collect address from:
<div class="ACA_TabRow ACA_FLeft">
Mailing
<br/>
1961 MAIN ST #186
<br/>
WATSONVILLE, CA, 95076
<br/>
United States
<br/>
</div>
I've tried so far:
from bs4 import BeautifulSoup
import re
html = """
<div class="ACA_TabRow ACA_FLeft">
Mailing
<br/>
1961 MAIN ST #186
<br/>
WATSONVILLE, CA, 95076
<br/>
United States
<br/>
</div>
"""
soup = BeautifulSoup(html,"lxml")
items = soup.find(class_="ACA_TabRow").find(string=re.compile("Mailing")).find_next_siblings()
print(items)
Upvotes: 0
Views: 46
Reputation: 137
It seems I've found a better solution:
from bs4 import BeautifulSoup
import re
html = """
<div class="ACA_TabRow ACA_FLeft">
Mailing
<br/>
1961 MAIN ST #186
<br/>
WATSONVILLE, CA, 95076
<br/>
United States
<br/>
</div>
"""
soup = BeautifulSoup(html,"lxml")
items = soup.find(class_="ACA_TabRow").find(string=re.compile("Mailing")).find_parent()
find_text = ' '.join([item.strip() for item in items.strings])
print(find_text)
Output:
Mailing 1961 MAIN ST #186 WATSONVILLE, CA, 95076 United States
Upvotes: 0
Reputation: 8273
I will keep check if stripped string inside the div startswith Mailing
soup = BeautifulSoup(html,"lxml")
items = soup.find(class_="ACA_TabRow")
for i,item in enumerate(items.stripped_strings):
if i==0 and not item.startswith('Mailing'):
break
if i!=0:
print(item)
Output
1961 MAIN ST #186
WATSONVILLE, CA, 95076
United States
Upvotes: 2
Reputation: 28630
from bs4 import BeautifulSoup
import re
html = """
<div class="ACA_TabRow ACA_FLeft">
Mailing
<br/>
1961 MAIN ST #186
<br/>
WATSONVILLE, CA, 95076
<br/>
United States
<br/>
</div>
"""
soup = BeautifulSoup(html,"lxml")
items = soup.find(class_="ACA_TabRow")
items_list = items.text.split('\n')
results = [ x.strip() for x in items_list if x.strip() != '' ]
Output:
print (results)
['Mailing', '1961 MAIN ST #186', 'WATSONVILLE, CA, 95076', 'United States']
Upvotes: 0