technophile_3
technophile_3

Reputation: 521

Unable to scrape multiple URLs from a website using selenium python

I am trying to scrape the date and url of the article from here. While I do get the list of dates and the headlines of the articles(in text) I am failing to get Urls for the same. This is how I am getting the url headlines in text and the dates.

def sb_rum():
    websites = ['https://www.thespiritsbusiness.com/tag/rum/']
    for spirits in websites:
        browser.get(spirits)
        time.sleep(1)

        news_links = browser.find_elements_by_xpath('//*[@id="archivewrapper"]/div/div[2]/h3')
        n_links = [ele.text for ele in news_links]
        dates = browser.find_elements_by_xpath('//*[@id="archivewrapper"]/div/div[2]/small')
        n_dates = [ele.text for ele in dates]
        print(n_links)
        print(n_dates)

This gives me an output like

['Harpalion Spirits expands UK distribution', 'Bacardí gets fruity with new tropical rum', 'The world’s biggest-selling rums', 'Havana Club releases Tributo 2021 rum', 'Ron Santiago de Cu
ba rum revamps range', 'Michael B Jordan to change rum name after backlash', 'WIRD recognised for sustainable sugarcane practices', 'Rockstar Spirits advocates for UK-Australia trade deal
', 'Rum Brand Champion 2021: Tanduay', 'Dictador and Niepoort partner on new rum', 'Rockstar Spirits secures £25,000 Dragons’ Den funding', 'SB meets… Lucia Alliegro, Ron Carúpano', 'Brun
o Mars debuts Selvarey Coconut rum', 'Diplomático launches Mixed Consciously cocktail comp', 'Foursquare Distillery backs rum history research', 'Ron Cabezon signs distribution with Gordo
n & MacPhail', 'Havana Club launches smoky rum finished in whisky casks', 'Ron Colón and Bacoo Rum expand distribution', 'Harpalion Spirits launches Pedro Ximénez cask-finished rum', 'Rum
’s journey to premiumisation']
['July 13th, 2021', 'July 8th, 2021', 'July 6th, 2021', 'June 30th, 2021', 'June 29th, 2021', 'June 24th, 2021', 'June 21st, 2021', 'June 21st, 2021', 'June 21st, 2021', 'June 18th, 2021'
, 'June 11th, 2021', 'June 7th, 2021', 'June 4th, 2021', 'June 2nd, 2021', 'May 28th, 2021', 'May 28th, 2021', 'May 26th, 2021', 'May 26th, 2021', 'May 24th, 2021', 'May 20th, 2021']

But I simply want to get the url link for the same. For instance, I am able to extract the link for one, but I fail to extract for all. To get the links for all I try something like

n_links = [ele.get_attribute('href') for ele in news_links.find_elements_by_tag_name('a')]

How can it be done? Please help.

Upvotes: 0

Views: 240

Answers (2)

RG_RG
RG_RG

Reputation: 396

Working solution,

n_links  = [ele.find_element_by_tag_name('a').get_attribute('href') for ele in news_links]

Upvotes: 1

Ram
Ram

Reputation: 4779

I don't think you need selenium to scrape this webpage. I have used beautifulsoup to scrape the data you need.

Here is the Code:

import bs4 as bs
import requests

url = 'https://www.thespiritsbusiness.com/tag/rum/'
resp = requests.get(url)
soup = bs.BeautifulSoup(resp.text, 'lxml')
divs = soup.findAll('div', class_='archiveEntry')
urls = []
titles = []
dates = []
for i in divs:
    urls.append(i.find('a')['href'].strip())
    titles.append(i.find('h3').text.strip())
    dates.append(i.find('small').text.strip())

Upvotes: 1

Related Questions