ss_0708
ss_0708

Reputation: 193

Scraping URLs in a webpage using BeautifulSoup

Below is the code to scrape this webpage. Out of all the URLS on the page, I need only those which have further information about the job postings, for example, the URLs to companies name like - "Abbot" , "Abbvie", "Affymetrix", and so on.

import requests
import pandas as pd
import re
from lxml import html
from bs4 import BeautifulSoup
from selenium import webdriver
list = ['#medical-device','#engineering','#recruitment','#job','#linkedin']
page = "https://dpseng.com.sg/definitive-singapore-pharma-job-website-directory/"
list_of_pages = [page + x for x in list]
for info in list_of_pages:
    pages= requests.get(info)
    soup = BeautifulSoup(pages.content, 'html.parser')
    tags = [div.p for div in soup.find_all('div', attrs ={'class':'fusion-text'})]
    for m in tags:
        try:
            links = [link['href'] for link in tags]
        except KeyError:
            pass
        print(links)

The output i am getting is a blank list like below:

[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]

What should i add/edit in the above code to scrape the URLs & further information in these URLs.

Thanks !!

Upvotes: 1

Views: 64

Answers (2)

Joseph Woolf
Joseph Woolf

Reputation: 550

What I noticed is that the webpages with anchors don't really isolate the HTML that you really want. Thus, you're grabbing all instances of <div class='fusion-text'>.

The following code example will retrieve all URLs that you want:

import requests
from bs4 import BeautifulSoup

# Get webpage 
page = "https://dpseng.com.sg/definitive-singapore-pharma-job-website-directory/"
soup= BeautifulSoup(requests.get(page).content, 'html.parser')
# Grab all URLs under each section
for section in ['medical-device','engineering','recruitment','job','linkedin']:
    subsection = soup.find('div', attrs ={'id': section})
    links = [a['href'] for a in subsection.find_all('a')]
    print("{}: {}".format(section, links))

Upvotes: 1

user5386938
user5386938

Reputation:

Perhaps try something like

import requests
from bs4 import BeautifulSoup
list = ['#medical-device','#engineering','#recruitment','#job','#linkedin']
page = "https://dpseng.com.sg/definitive-singapore-pharma-job-website-directory/"
list_of_pages = [page + x for x in list]
for info in list_of_pages:
    pages= requests.get(info)
    soup = BeautifulSoup(pages.content, 'html.parser')
    tags = [div.p for div in soup.find_all('div', attrs ={'class':'fusion-text'})]

    links = []
    for p in tags:
        links.extend([a['href'] for a in p.find_all('a')])

    print(links)

Upvotes: 1

Related Questions