jkovba
jkovba

Reputation: 1269

Scrapy - Use CSS to find elements that might contain differing subelements

I'm using Scrapy to scrape https://www.hillhappenings.com/ for a number of data fields related to political events: name, time, data, and location. I realized the HTML for the location field uses two different formats:

<li class="eventlist-meta-item eventlist-meta-address event-meta-item">
    2168 Rayburn House Office Building
</li>

...and ...

<li class="eventlist-meta-item eventlist-meta-address event-meta-item">            
    <span class="eventlist-meta-address-line">A St.</span>
    <span class="eventlist-meta-address-line">Washington, DC, 20002</span>
    <span class="eventlist-meta-address-line">United States</span>
</li>

I'm using the following code to get the event titles and locations:

events = Selector(response=response).css('div.eventlist-column-info a.eventlist-title-link::text').getall()
addresses = Selector(response=response).css('div.eventlist-column-info li.eventlist-meta-item.eventlist-meta-address::text').getall()

The problem is, out of 80 events, 76 use format #1 and 4 use format #2 so I get 80 events but only 76 addresses. I would like to be able to get the multiline addresses that use format #2 above in a single line like format #1. I'm new to Scrapy as of this morning and am wondering: "How can I use Scrapy to find address elements that have a span tag underneath of them so I can combine them into a single-line address?".

Upvotes: 0

Views: 186

Answers (1)

abdusco
abdusco

Reputation: 11151

Maybe try attribute [attr] or wildcard selectors *? Since both formats contain text in an element with class eventlist-meta-address-*, you can use [class*="eventlist-meta-address"]::text or just .eventlist-meta-address *::text

from parsel import Selector

def extract_address(sel: Selector) -> str:
    # this one works too
    # metas = s.css('.eventlist-meta-address *::text').getall()
    metas = s.css('[class*="eventlist-meta-address"]::text').getall()
    return ' '.join(m.strip() for m in metas if m.strip())

if __name__ == '__main__':
    format1 = '''
    <li class="eventlist-meta-item eventlist-meta-address event-meta-item">
        2168 Rayburn House Office Building
    </li>
    '''
    format2 = '''
    <li class="eventlist-meta-item eventlist-meta-address event-meta-item">
        <span class="eventlist-meta-address-line">A St.</span>
        <span class="eventlist-meta-address-line">Washington, DC, 20002</span>
        <span class="eventlist-meta-address-line">United States</span>
    </li>
    '''
    for f in [format1, format2]:
        s = Selector(f)
        print(extract_address(s))

output:

2168 Rayburn House Office Building
A St. Washington, DC, 20002 United States

Upvotes: 1

Related Questions