Reputation: 163
I am trying to extract the text from the dd classes in between the dd tags (which are being used for to mark different dates). I tried a really hackey method but it didn't work consistenly enough
timeDiv = mezzrowSource.find_all("dd", class_="orange event-date")
eventDiv = mezzrowSource.find_all("dd", class_="event")
index = 0
for time in timeDiv:
returnValue[timeDiv[index].text] = eventDiv[index].text.strip()
if "8" in timeDiv[index+3].text or "4:30" in timeDiv[index+3].text:
break
index += 1
Enumerating in that way resulted in too much text from otherorked most of the time but would sometimes extract events from other dates. Here source of the section in question is pasted below. Any ideas?
<dt class="purple">Sun, September 30th, 2018</dt>
<dd class="orange event-date">4:30 PM to 7:00 PM</dd>
<dd class="event"><a href="/events/4094-mezzrow-classical-salon-with-david-oei"
class="event-title">Mezzrow Classical Salon with David Oei</a>
</dd>
<dd class="orange event-date">8:00 PM to 10:30 PM</dd>
<dd class="event"><a href="/events/4144-luke-sellick-ron-blake-adam-birnbaum"
class="event-title">Luke Sellick, Ron Blake & Adam Birnbaum</a>
</dd>
<dd class="orange event-date">11:00 PM to 1:00 AM</dd>
<dd class="event"><a href="/events/4099-ryo-sasaki-friends-after-hours"
class="event-title">Ryo Sasaki & Friends "After-hours"</a>
</dd>
<dt class="purple">Mon, October 1st, 2018</dt>
<dd class="orange event-date">8:00 PM to 10:30 PM</dd>
<dd class="event"><a href="/events/4137-greg-ruggiero-murray-wall-steve-little"
class="event-title">Greg Ruggiero, Murray Wall & Steve Little</a>
</dd>
<dd class="orange event-date">11:00 PM to 1:00 AM</dd>
<dd class="event"><a href="/events/4174-pasquale-grasso-after-hours"
class="event-title">Pasquale Grasso "After-hours"</a>
</dd>
Expected output is a dictionary that looks like this: {'4:30 PM to 7:00 PM': 'Mezzrow Classical Salon with David Oei', '8:00 PM to 10:30 PM': 'Greg Ruggiero, Murray Wall & Steve Little', '11:00 PM to 1:00 AM': 'Pasquale Grasso "After-hours"'}
Upvotes: 0
Views: 370
Reputation: 9430
If I understand the question correctly you can use zip():
mezzrowSource = BeautifulSoup(html , 'lxml')
timeDiv = [tag.get_text() for tag in mezzrowSource.find_all("dd", class_="orange event-date")]
eventDiv = [tag.get_text().strip() for tag in mezzrowSource.find_all("dd", class_="event")]
print(dict(zip(timeDiv, eventDiv)))
Outputs:
{'4:30 PM to 7:00 PM': 'Mezzrow Classical Salon with David Oei', '8:00 PM to 10:30 PM': 'Greg Ruggiero, Murray Wall & Steve Little', '11:00 PM to 1:00 AM': 'Pasquale Grasso "After-hours"'}
Updated:
The elements you want data from are all siblings i.e. there are no elements containing each set of data, which makes it harder to get the data grouped as you want. The only thing in your favor is the fact that the element with the date comes first then the time and then the title. The time and title can be repeated. So this method selects all the elements we want and iterates over them. In the first iteration it stores the date in a string and creates a list of tuples containing the times and titles. When it next finds a date it appends the date and the list of tuple to a dictionary. At the end of the iterations it appends the final date and list of tuples to the dictionary. It is a bit messy but that is due to the lack of structure in the HTML.
from bs4 import BeautifulSoup
import requests
import re
import pprint
url = 'https://www.mezzrow.com/'
r = requests.get(url)
soup = BeautifulSoup(r.text , 'lxml')
ds = soup.find_all(True, {'class': re.compile('purple|event|orange event_date')})
ret = {}
tmp = []
i = None
for d in ds:
if d.attrs['class']==['purple']:
if i is not None:
ret[i] = tmp
tmp = []
i = (d.get_text())
elif d.attrs['class']==['orange', 'event-date']:
j = d.get_text()
elif d.attrs['class']==['event']:
tmp.append ((j,d.get_text(strip=True)))
ret[i] = tmp
pp = pprint.PrettyPrinter(depth=6)
pp.pprint(ret)
outputs:
{'Fri, October 12th, 2018': [('8:00 PM to 10:30 PM',
'Rossano Sportiello, Pasquale Grasso & Frank '
'Tate'),
('11:00 PM to 2:00 AM',
'Ben Paterson "After-hours"')],
'Fri, October 5th, 2018': [('8:00 PM to 10:30 PM',
'Vanessa Rubin, Brandon McCune, Kenny Davis & '
'Winard Harper'),
('11:00 PM to 2:00 AM',
'Joe Davidian "After-hours"')],
'Mon, October 1st, 2018': [('8:00 PM to 10:30 PM',
'Greg Ruggiero, Murray Wall & Steve Little'),
('11:00 PM to 1:00 AM',
'Pasquale Grasso "After-hours"')],
....
Then select the date you want from the dict object.
Upvotes: 1