Scrapy - Use CSS to find elements that might contain differing subelements

Question

I'm using Scrapy to scrape https://www.hillhappenings.com/ for a number of data fields related to political events: name, time, data, and location. I realized the HTML for the location field uses two different formats:


    2168 Rayburn House Office Building

...and ...

            
    A St.
    Washington, DC, 20002
    United States

I'm using the following code to get the event titles and locations:

events = Selector(response=response).css('div.eventlist-column-info a.eventlist-title-link::text').getall()
addresses = Selector(response=response).css('div.eventlist-column-info li.eventlist-meta-item.eventlist-meta-address::text').getall()

The problem is, out of 80 events, 76 use format #1 and 4 use format #2 so I get 80 events but only 76 addresses. I would like to be able to get the multiline addresses that use format #2 above in a single line like format #1. I'm new to Scrapy as of this morning and am wondering: "How can I use Scrapy to find address elements that have a span tag underneath of them so I can combine them into a single-line address?".

abdusco · Accepted Answer

Maybe try attribute [attr] or wildcard selectors *? Since both formats contain text in an element with class eventlist-meta-address-*, you can use [class*="eventlist-meta-address"]::text or just .eventlist-meta-address *::text

from parsel import Selector

def extract_address(sel: Selector) -> str:
    # this one works too
    # metas = s.css('.eventlist-meta-address *::text').getall()
    metas = s.css('[class*="eventlist-meta-address"]::text').getall()
    return ' '.join(m.strip() for m in metas if m.strip())

if __name__ == '__main__':
    format1 = '''
    
        2168 Rayburn House Office Building
    
    '''
    format2 = '''
    
        A St.
        Washington, DC, 20002
        United States
    
    '''
    for f in [format1, format2]:
        s = Selector(f)
        print(extract_address(s))

output:

2168 Rayburn House Office Building
A St. Washington, DC, 20002 United States

Scrapy - Use CSS to find elements that might contain differing subelements

Answers (1)

Related Questions