Reputation: 193
Below is the code to scrape this webpage. Out of all the URLS on the page, I need only those which have further information about the job postings, for example, the URLs to companies name like - "Abbot" , "Abbvie", "Affymetrix", and so on.
import requests
import pandas as pd
import re
from lxml import html
from bs4 import BeautifulSoup
from selenium import webdriver
list = ['#medical-device','#engineering','#recruitment','#job','#linkedin']
page = "https://dpseng.com.sg/definitive-singapore-pharma-job-website-directory/"
list_of_pages = [page + x for x in list]
for info in list_of_pages:
pages= requests.get(info)
soup = BeautifulSoup(pages.content, 'html.parser')
tags = [div.p for div in soup.find_all('div', attrs ={'class':'fusion-text'})]
for m in tags:
try:
links = [link['href'] for link in tags]
except KeyError:
pass
print(links)
The output i am getting is a blank list like below:
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
What should i add/edit in the above code to scrape the URLs & further information in these URLs.
Thanks !!
Upvotes: 1
Views: 64
Reputation: 550
What I noticed is that the webpages with anchors don't really isolate the HTML that you really want. Thus, you're grabbing all instances of <div class='fusion-text'>
.
The following code example will retrieve all URLs that you want:
import requests
from bs4 import BeautifulSoup
# Get webpage
page = "https://dpseng.com.sg/definitive-singapore-pharma-job-website-directory/"
soup= BeautifulSoup(requests.get(page).content, 'html.parser')
# Grab all URLs under each section
for section in ['medical-device','engineering','recruitment','job','linkedin']:
subsection = soup.find('div', attrs ={'id': section})
links = [a['href'] for a in subsection.find_all('a')]
print("{}: {}".format(section, links))
Upvotes: 1
Reputation:
Perhaps try something like
import requests
from bs4 import BeautifulSoup
list = ['#medical-device','#engineering','#recruitment','#job','#linkedin']
page = "https://dpseng.com.sg/definitive-singapore-pharma-job-website-directory/"
list_of_pages = [page + x for x in list]
for info in list_of_pages:
pages= requests.get(info)
soup = BeautifulSoup(pages.content, 'html.parser')
tags = [div.p for div in soup.find_all('div', attrs ={'class':'fusion-text'})]
links = []
for p in tags:
links.extend([a['href'] for a in p.find_all('a')])
print(links)
Upvotes: 1