SIM
SIM

Reputation: 22440

Unable to get the desired portion kicking out the rest

I've written a script in python to grab address from a webpage. When I execute my script I get the address like Zimmerbachstr. 51, 74676 Niedernhall-Waldzimmern Germany. They all are within this selector "[itemprop='address']". However, my question is how can i get the address except for the country name which is within "[itemprop='addressCountry']".

The total address are within this block of html:

<div class="push-half--bottom " itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress">
    <span itemprop="streetAddress">Zimmerbachstr. 51</span>
    <span itemprop="postalCode">74676</span>
    <span itemprop="addressLocality">Niedernhall-Waldzimmern</span><br>
    <span itemprop="addressCountry">Germany</span><br>
</div>

If I try like below I can get the desired portion of address but this is not an ideal way at all:

from bs4 import BeautifulSoup

content = """
<div class="push-half--bottom " itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress">
    <span itemprop="streetAddress">Zimmerbachstr. 51</span>
    <span itemprop="postalCode">74676</span>
    <span itemprop="addressLocality">Niedernhall-Waldzimmern</span><br>
    <span itemprop="addressCountry">Germany</span><br>
</div>
"""
soup = BeautifulSoup(content,"lxml")
[country.extract() for country in soup.select("[itemprop='addressCountry']")]
item = [item.get_text(strip=True) for item in soup.select("[itemprop='address']")]
print(item)

This is the expected output Zimmerbachstr. 51, 74676 Niedernhall-Waldzimmern.

To be clearer: I would like to have any oneliner solution without any hardcoded index applied (because the country name may not always appear in the last position).

Upvotes: 1

Views: 38

Answers (1)

Andersson
Andersson

Reputation: 52665

Solution using lxml.html:

from lxml import html

content = """
<div class="push-half--bottom " itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress">
    <span itemprop="streetAddress">Zimmerbachstr. 51</span>
    <span itemprop="postalCode">74676</span>
    <span itemprop="addressLocality">Niedernhall-Waldzimmern</span><br>
    <span itemprop="addressCountry">Germany</span><br>
</div>
"""

source = html.fromstring(content)
address = ", ".join([span.text for span in source.xpath("//div[@itemprop='address']/span[@itemprop='streetAddress' or @itemprop='postalCode' or @itemprop='addressLocality']")])

or

address = ", ".join([span.text for span in source.xpath("//div[@itemprop='address']/span[not(@itemprop='addressCountry')]")])

Output:

'Zimmerbachstr. 51, 74676, Niedernhall-Waldzimmern'

Upvotes: 1

Related Questions