Reputation: 1269
I'm using Scrapy to scrape https://www.hillhappenings.com/ for a number of data fields related to political events: name, time, data, and location. I realized the HTML for the location field uses two different formats:
<li class="eventlist-meta-item eventlist-meta-address event-meta-item">
2168 Rayburn House Office Building
</li>
...and ...
<li class="eventlist-meta-item eventlist-meta-address event-meta-item">
<span class="eventlist-meta-address-line">A St.</span>
<span class="eventlist-meta-address-line">Washington, DC, 20002</span>
<span class="eventlist-meta-address-line">United States</span>
</li>
I'm using the following code to get the event titles and locations:
events = Selector(response=response).css('div.eventlist-column-info a.eventlist-title-link::text').getall()
addresses = Selector(response=response).css('div.eventlist-column-info li.eventlist-meta-item.eventlist-meta-address::text').getall()
The problem is, out of 80 events, 76 use format #1 and 4 use format #2 so I get 80 events but only 76 addresses. I would like to be able to get the multiline addresses that use format #2 above in a single line like format #1. I'm new to Scrapy as of this morning and am wondering: "How can I use Scrapy to find address elements that have a span tag underneath of them so I can combine them into a single-line address?".
Upvotes: 0
Views: 186
Reputation: 11151
Maybe try attribute [attr]
or wildcard selectors *
? Since both formats contain text in an element with class eventlist-meta-address-*
, you can use [class*="eventlist-meta-address"]::text
or just .eventlist-meta-address *::text
from parsel import Selector
def extract_address(sel: Selector) -> str:
# this one works too
# metas = s.css('.eventlist-meta-address *::text').getall()
metas = s.css('[class*="eventlist-meta-address"]::text').getall()
return ' '.join(m.strip() for m in metas if m.strip())
if __name__ == '__main__':
format1 = '''
<li class="eventlist-meta-item eventlist-meta-address event-meta-item">
2168 Rayburn House Office Building
</li>
'''
format2 = '''
<li class="eventlist-meta-item eventlist-meta-address event-meta-item">
<span class="eventlist-meta-address-line">A St.</span>
<span class="eventlist-meta-address-line">Washington, DC, 20002</span>
<span class="eventlist-meta-address-line">United States</span>
</li>
'''
for f in [format1, format2]:
s = Selector(f)
print(extract_address(s))
output:
2168 Rayburn House Office Building
A St. Washington, DC, 20002 United States
Upvotes: 1