Reputation: 521
I am trying to scrape the date and url of the article from here. While I do get the list of dates and the headlines of the articles(in text) I am failing to get Urls for the same. This is how I am getting the url headlines in text and the dates.
def sb_rum():
websites = ['https://www.thespiritsbusiness.com/tag/rum/']
for spirits in websites:
browser.get(spirits)
time.sleep(1)
news_links = browser.find_elements_by_xpath('//*[@id="archivewrapper"]/div/div[2]/h3')
n_links = [ele.text for ele in news_links]
dates = browser.find_elements_by_xpath('//*[@id="archivewrapper"]/div/div[2]/small')
n_dates = [ele.text for ele in dates]
print(n_links)
print(n_dates)
This gives me an output like
['Harpalion Spirits expands UK distribution', 'Bacardí gets fruity with new tropical rum', 'The world’s biggest-selling rums', 'Havana Club releases Tributo 2021 rum', 'Ron Santiago de Cu
ba rum revamps range', 'Michael B Jordan to change rum name after backlash', 'WIRD recognised for sustainable sugarcane practices', 'Rockstar Spirits advocates for UK-Australia trade deal
', 'Rum Brand Champion 2021: Tanduay', 'Dictador and Niepoort partner on new rum', 'Rockstar Spirits secures £25,000 Dragons’ Den funding', 'SB meets… Lucia Alliegro, Ron Carúpano', 'Brun
o Mars debuts Selvarey Coconut rum', 'Diplomático launches Mixed Consciously cocktail comp', 'Foursquare Distillery backs rum history research', 'Ron Cabezon signs distribution with Gordo
n & MacPhail', 'Havana Club launches smoky rum finished in whisky casks', 'Ron Colón and Bacoo Rum expand distribution', 'Harpalion Spirits launches Pedro Ximénez cask-finished rum', 'Rum
’s journey to premiumisation']
['July 13th, 2021', 'July 8th, 2021', 'July 6th, 2021', 'June 30th, 2021', 'June 29th, 2021', 'June 24th, 2021', 'June 21st, 2021', 'June 21st, 2021', 'June 21st, 2021', 'June 18th, 2021'
, 'June 11th, 2021', 'June 7th, 2021', 'June 4th, 2021', 'June 2nd, 2021', 'May 28th, 2021', 'May 28th, 2021', 'May 26th, 2021', 'May 26th, 2021', 'May 24th, 2021', 'May 20th, 2021']
But I simply want to get the url link for the same. For instance, I am able to extract the link for one, but I fail to extract for all. To get the links for all I try something like
n_links = [ele.get_attribute('href') for ele in news_links.find_elements_by_tag_name('a')]
How can it be done? Please help.
Upvotes: 0
Views: 240
Reputation: 396
Working solution,
n_links = [ele.find_element_by_tag_name('a').get_attribute('href') for ele in news_links]
Upvotes: 1
Reputation: 4779
I don't think you need selenium
to scrape this webpage. I have used beautifulsoup
to scrape the data you need.
Here is the Code:
import bs4 as bs
import requests
url = 'https://www.thespiritsbusiness.com/tag/rum/'
resp = requests.get(url)
soup = bs.BeautifulSoup(resp.text, 'lxml')
divs = soup.findAll('div', class_='archiveEntry')
urls = []
titles = []
dates = []
for i in divs:
urls.append(i.find('a')['href'].strip())
titles.append(i.find('h3').text.strip())
dates.append(i.find('small').text.strip())
Upvotes: 1