Scraping URLs in a webpage using BeautifulSoup

Question

Below is the code to scrape this webpage. Out of all the URLS on the page, I need only those which have further information about the job postings, for example, the URLs to companies name like - "Abbot" , "Abbvie", "Affymetrix", and so on.

import requests
import pandas as pd
import re
from lxml import html
from bs4 import BeautifulSoup
from selenium import webdriver
list = ['#medical-device','#engineering','#recruitment','#job','#linkedin']
page = "https://dpseng.com.sg/definitive-singapore-pharma-job-website-directory/"
list_of_pages = [page + x for x in list]
for info in list_of_pages:
    pages= requests.get(info)
    soup = BeautifulSoup(pages.content, 'html.parser')
    tags = [div.p for div in soup.find_all('div', attrs ={'class':'fusion-text'})]
    for m in tags:
        try:
            links = [link['href'] for link in tags]
        except KeyError:
            pass
        print(links)

The output i am getting is a blank list like below:

[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]

What should i add/edit in the above code to scrape the URLs & further information in these URLs.

Thanks !!

Joseph Woolf · Accepted Answer

What I noticed is that the webpages with anchors don't really isolate the HTML that you really want. Thus, you're grabbing all instances of

.

The following code example will retrieve all URLs that you want:

import requests
from bs4 import BeautifulSoup

# Get webpage 
page = "https://dpseng.com.sg/definitive-singapore-pharma-job-website-directory/"
soup= BeautifulSoup(requests.get(page).content, 'html.parser')
# Grab all URLs under each section
for section in ['medical-device','engineering','recruitment','job','linkedin']:
    subsection = soup.find('div', attrs ={'id': section})
    links = [a['href'] for a in subsection.find_all('a')]
    print("{}: {}".format(section, links))

Scraping URLs in a webpage using BeautifulSoup

Answers (2)

Related Questions