reesh19
reesh19

Reputation: 75

beautifulsoup returns None for any element I try

I'm building a fully automated get-a-job application, funny enough the automation portion is fairly simple, however the scrapping not so much.

In short, requests + beautifulsoup has worked for the majority of domains I am scrapping, however nothing works when trying the same process on workable pages:

import requests
from bs4 import BeautifulSoup as bs

session = requests.Session()
url = 'https://apply.workable.com/breederdao-1/j/602097ACC9/'
req = session.get(url)

title = soup.find('h1', {'data-ui': 'job-title'})
print(title)

>>> None

details = soup.find('span', {'data-ui': 'job-location'})
print(details)

>>> None

Both elements are under body, however when I try to fetch the page's title I do get what I expect:

title_0 = soup.find('title')
print(title_0)

>>> <title>Data Analyst (Fully Remote) - BreederDAO</title>

I tried using await + HTMLSEssion / AsyncHTMLSession as well, but so long as the element is inside of body, every find() still returns None.

Can anyone educate me on this? My current hypothesis is that the website has some kind of anti-scrapping mechanism, but I have zero idea where to even start looking. This element does look extra sus though:

<html...
  <head>...</head>
  <body>
    .
    .
    .
    <noscript>
      <iframe height="0" width="0" src="https://www.googletagmanager.com/ns.html?id=GTM-WKS7WTT&amp;gtm_auth=SGnzIn3pcB7S4fevFXOKPQ&amp;gtm_preview=env-2&amp;gtm_cookies_win=x" style="display: none; visibility: hidden;">
        #document
          <!DOCTYPE html>
          <html lang="en">
            <head>
              <meta charset="utf-8">
              <title>ns</title>
            </head>
            <body>
              " "
            </body>
          </html>
      </iframe>
    </noscript>
    .
    .
    .
  </body>
</html>

Upvotes: 2

Views: 39

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195573

The data you see is loaded from external URL via javascript. To load it you can use requests module. For example:

import json
import requests


# 602097ACC9 is from your URL
url = "https://apply.workable.com/api/v2/accounts/breederdao-1/jobs/602097ACC9"
data = requests.get(url).json()

# uncomment to print all data:
# print(json.dumps(data, indent=4))

print(data["title"])
print(", ".join(data["location"].values()))

Prints:

Data Analyst (Fully Remote)
Philippines, PH, Makati, Metro Manila

Upvotes: 1

Related Questions