Abdelrahmane Khaldi
Abdelrahmane Khaldi

Reputation: 97

Beautifulsoup : href link is undefined

I want to scrap a website, when I reach any <a> tag the link is "job/undefined", I used post request to fetch data from the page.

Post request with postdata in this code :

from bs4 import BeautifulSoup
import requests

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"}

postData = {
  'search': 'search',
  'facets[camp_type]':'day_camp',
  'open[choices-made-content]': 'true'}

url = 'https://www.trustme.work/en'
html_1 = requests.post(url, headers=headers, data=postData)

soup1 = BeautifulSoup(html_1.text, 'lxml')
a = soup1.select('div.MuiGrid-root MuiGrid-grid-xs-12 ')
b = soup1.select('span[class="MuiTypography-root MuiTypography-h2"]')
print('soup:',b)

Sample from the output :

<span class="MuiTypography-root MuiTypography-h2" style="cursor:pointer">
    <a href="job/undefined" style="color:#413E52;text-decoration:none">
    Network and Security engineer
    </a>
</span>

Upvotes: 1

Views: 60

Answers (1)

HedgeHog
HedgeHog

Reputation: 25241

EDIT

Part of content is served dynamically so, you have to fetch the jobs hashid via api and then create the link yourself or use the data from JSON response:

import requests

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"}
url = 'https://api.trustme.work/api/job_offers?include=technologies%2Cjob%2Ccompany%2Ccontract_type%2Clevel'
jobs = requests.get(url, headers=headers).json()['included']['jobs']

['https://www.trustme.work/job/' + v['hashid'] for k,v in jobs.items()]

To get the links from each job post change your css selector to select your elements more specific, also try to use static identifiers or HTML structure over classes:

.select('h2 a')

To get a list of all links use a list comprehension:

['https://www.trustme.work' + a.get('href') for a in soup1.select('h2 a')]

Example

from bs4 import BeautifulSoup
import requests

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"}

postData = {
 'search': 'search',
 'facets[camp_type]':'day_camp',
 'open[choices-made-content]': 'true'}

url = 'https://www.trustme.work/en'
html_1 = requests.post(url, headers=headers, data=postData)

soup1 = BeautifulSoup(html_1.text, 'lxml')
['https://www.trustme.work' + a.get('href') for a in soup1.select('h2 a')]

Upvotes: 3

Related Questions