Mr.Rlover
Mr.Rlover

Reputation: 2623

Get HTML href link that matches string from a list of strings with Beautiful Soup

I am trying to get urls from a webpage that has a list of urls. I do not want to get all the urls, only the ones whose text matches the text of the strings in a list. The list of strings is a subset of the text of the links on the webpage, which I extracted by scraping the page and removing the text that I do not want. I have a list of strings stored in filenames.

I am trying to extract the links that have the strings in the list. Below returns an empty list

 r = requests.get(url)

    soup = BeautifulSoup(r.content, 'html5lib')
    
    links = soup.findAll('a', string = filenames[0])
    
    file_links = [link['href'] for link in links if "export" in link['href']]

The tag looks something like this:

<p><a href="https://drive.google.com/uc?export=download&id=1wVjbdN9fztrjxhONGRX5U6N1OJDAChOi">
                            ECZ Mathematics Paper 2 2019.</a></p>

<p><a href="https://drive.google.com/uc?export=download&id=1x_9E3PaviCuSsqfJqOsQKOwVlCWZ1jqf">
                            ECZ Mathematics Paper 1 2019.</a></p>

<p><a href="https://drive.google.com/uc?export=download&id=1QFOzpPLuQPup8FtKgOoIcvzTnzCaRzUp">
                            ECZ Science Paper 3 2009.</a></p>

<p><a href="https://drive.google.com/uc?export=download&id=0B0lFc6TrfIg7aENYc1V6akRVVnc">
                            ECZ Civic Education Paper 2 2009.</a></p>

I want to get the href links of first three but not the last one, since the string 'ECZ Civic Education Paper 2 2009.' is not part of my list of strings. Link to site is here

My list of strings looks like this:


filenames = ['ECZ Mathematics Paper 2 2019.', 'ECZ Mathematics Paper 2 2019.',
             'ECZ Science Paper 3 2009.']

I only want the first three links, because the text of the links are in my list (filenames). I don't want the fourth link because the text next to the href link, (ECZ Civic Education Paper 2 2009.) isn't in my list, because I don't want to download this file.

Upvotes: 2

Views: 470

Answers (3)

Andrej Kesely
Andrej Kesely

Reputation: 195613

You can construct CSS selector and then select the links in one go. For example (html is your code snippet from the question):

filenames = ['ECZ Mathematics Paper 1 2019.',
             'ECZ Mathematics Paper 2 2019.',
             'ECZ Science Paper 3 2009.']

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

for a in soup.select(','.join('a:contains("{}")'.format(i) for i in filenames)):
    print(a['href'])

Prints:

https://drive.google.com/uc?export=download&id=1wVjbdN9fztrjxhONGRX5U6N1OJDAChOi
https://drive.google.com/uc?export=download&id=1x_9E3PaviCuSsqfJqOsQKOwVlCWZ1jqf
https://drive.google.com/uc?export=download&id=1QFOzpPLuQPup8FtKgOoIcvzTnzCaRzUp

Upvotes: 1

Jack Fleeting
Jack Fleeting

Reputation: 24940

Try it this way and see if it works:

   html = """    
    <p><a href="https://drive.google.com/uc?export=download&id=1wVjbdN9fztrjxhONGRX5U6N1OJDAChOi">
                                ECZ Mathematics Paper 2 2019.</a></p>    
    <p><a href="https://drive.google.com/uc?export=download&id=1x_9E3PaviCuSsqfJqOsQKOwVlCWZ1jqf">
                                ECZ Mathematics Paper 1 2019.</a></p>    
    <p><a href="https://drive.google.com/uc?export=download&id=1QFOzpPLuQPup8FtKgOoIcvzTnzCaRzUp">
                                ECZ Science Paper 3 2009.</a></p>    
    <p><a href="https://drive.google.com/uc?export=download&id=0B0lFc6TrfIg7aENYc1V6akRVVnc">
                                ECZ Civic Education Paper 2 2009.</a></p>   
   """
    filenames = ['ECZ Mathematics Paper 2 2019.', 'ECZ Mathematics Paper 2 2019.',
                 'ECZ Science Paper 3 2009.']

    soup = bs(html, 'html5lib')

    all_links = soup.findAll('a')

    for link in all_links:           
        for nam in filenames:                
            if link.text.strip()==nam:
                print(link['href'])

Output:

https://drive.google.com/uc?export=download&id=1wVjbdN9fztrjxhONGRX5U6N1OJDAChOi
https://drive.google.com/uc?export=download&id=1wVjbdN9fztrjxhONGRX5U6N1OJDAChOi
https://drive.google.com/uc?export=download&id=1QFOzpPLuQPup8FtKgOoIcvzTnzCaRzUp

Upvotes: 1

Pritish kumar
Pritish kumar

Reputation: 512

If the request has been received successfully. Then just parse it using bs and find tags for the links "a" using findAll. I think there is no need to pass (string = filenames[0]) in findAll.

from bs4 import BeautifulSoup as bs
temp = """<p><a href="https://drive.google.com/uc?export=download&id=1wVjbdN9fztrjxhONGRX5U6N1OJDAChOi">
                            ECZ Mathematics Paper 2 2019.</a></p>

<p><a href="https://drive.google.com/uc?export=download&id=1x_9E3PaviCuSsqfJqOsQKOwVlCWZ1jqf">
                            ECZ Mathematics Paper 1 2019.</a></p>

<p><a href="https://drive.google.com/uc?export=download&id=1QFOzpPLuQPup8FtKgOoIcvzTnzCaRzUp">
                            ECZ Science Paper 3 2009.</a></p>

<p><a href="https://drive.google.com/uc?export=download&id=0B0lFc6TrfIg7aENYc1V6akRVVnc">
                            ECZ Civic Education Paper 2 2009.</a></p>"""

soup =bs(temp, 'html5lib')
links = soup.findAll('a')
file_links = [link['href'] for link in links if "export" in link['href']]

Output:

['https://drive.google.com/uc?export=download&id=1wVjbdN9fztrjxhONGRX5U6N1OJDAChOi',
 'https://drive.google.com/uc?export=download&id=1x_9E3PaviCuSsqfJqOsQKOwVlCWZ1jqf',
 'https://drive.google.com/uc?export=download&id=1QFOzpPLuQPup8FtKgOoIcvzTnzCaRzUp',
 'https://drive.google.com/uc?export=download&id=0B0lFc6TrfIg7aENYc1V6akRVVnc']

Upvotes: 0

Related Questions