Click on multiple links with Selenium in python

Question

I'm trying to web-scrape data from a structure that looks like that:


        
            
                Useful Data
                
            
                Useful Data
                
            

        

        
        
            
                Useful Data
                
            
                Useful Data

The data that I want is in the div "data", and also on a some other pages accessible by clicking on the urls. I iterate through the 'tables' using BeautifulSoup, and I'm trying to click on the links with Selenium like so:

tables = soup.find_all('div', class_ = 'tables')
 for line in tables:
     row = line.find_all('div', class_ = "row")
     for element in row:
         link = driver.find_element_by_xpath('//a[contains(@href,"href")]')
         #some code

In my script, this line

link = driver.find_element_by_xpath('//a[contains(@href,"href")]')

always return the first url, when I want it to 'follow' BeautifulSoup and return to following hrefs. So is there a way to modify href depending on the url from the source code? I should add that all my urls are pretty similiar, except for the last part. (ex.: url1 = questions/ask/1000 , url2 = questions/ask/1001)

I've also tried to find all the href in the page to iterate trough them using

links = self.driver.find_element_by_xpath('//a[@href]')

but that doesn't work either. Since the page contains a lot of links that aren't useful to me, I'm not sure if that's the best way to go.

HedgeHog · Accepted Answer

The mix seems to be a bit complicated - Why not extracting the href with BeautifulSoup directly?

for a in soup.select('.tables a[href]'):
    link = a['href']

You also can modify it, concat with baseUrl and store in a list to iterate over:

urls = [baseUrl+a['href'] for a in soup.select('.tables a[href]')]

Or instead with selenium itself and with use of find_elements instead of find_element to get all links not only the first one:

for a in driver.find_elements_by_xpath('//div[@class="tables"]//a[@href]'):
    print(a.get_attribute('href'))

Example

baseUrl = 'http://www.example.com'

html='''

        
            
                Useful Data
                
            
                Useful Data
                
            

        


        
            
                Useful Data
                
            
                Useful Data
                
            

        

     
'''
soup = BeautifulSoup(html,'lxml')

urls = [baseUrl+a['href'] for a in soup.select('.tables a[href]')]

for url in urls:
    print(url)#or request the website,....

Output

http://www.example.com/url1
http://www.example.com/url1
http://www.example.com/url3
http://www.example.com/url4

Click on multiple links with Selenium in python

Answers (1)

Example

Output

Related Questions