Pabluez
Pabluez

Reputation: 2775

Can't scrape from a specific website using python requests

I'm trying to scrape from this URL below but it's not brgingging the content I see when I access using a browser (the content from a public customer case/story). I tried also simulating a real browser with headers, but nothing so far. Any tip for me?

URL: https://customers.microsoft.com/en-us/story/767633-asos-retailer-azure-active-directory-m365

import requests
main_url = "https://customers.microsoft.com/en-us/story/767633-asos-retailer-azure-active-directory-m365"
result = requests.get(main_url)   
print(result.text)

Upvotes: 0

Views: 416

Answers (2)

Bertrand Martel
Bertrand Martel

Reputation: 45352

It uses an external API to get the data. You just need to make a call on :

GET https://customers.microsoft.com/en-us/api/search?key=STORY_KEY

STORY_KEY is 767633-asos-retailer-azure-active-directory-m365 eg the text after the last slash in the url. You could use a script like the following :

import requests

url = "https://customers.microsoft.com/en-us/story/767633-asos-retailer-azure-active-directory-m365"

r = requests.get(
    "https://customers.microsoft.com/en-us/api/search",
    params = {
        "key": url.rsplit('/', 1)[1]
    }
)
document = r.json()["search_document"]

summary = document["story_exec_summary"]
body = document["story_body_text_2"]
quote1 = document["story_quote_carousel"]
quote2 = document["story_quote_carousel_2"]

print(summary)
print(body)
print(quote1)
print(quote2)

Note that you would need to search what data you are looking for in the document object (videos, body3 etc...)

Upvotes: 1

Hussain Bohra
Hussain Bohra

Reputation: 1005

You would need to handle certificates properly. It would require additional packages:

pip install certifi
pip install urllib3

And we need to use different python library, i.e. urllib3

python
Python 3.7.7 (default, Mar 10 2020, 15:43:33)
[Clang 11.0.0 (clang-1100.0.33.17)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> import certifi
>>> import urllib3
>>>
>>> http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())
>>> main_url = "https://customers.microsoft.com/en-us/story/767633-asos-retailer-azure-active-directory-m365"
>>>
>>> r = http.request('GET', main_url)
>>> r.status
200
>>> r.data

>>> open("stories.html", "wb").write(r.data)

Output:

>>> r.data
b'\r\n<!doctype html>\r\n<html lang="en" xml:lang="en" dir="ltr">\r\n<head prefix="og: http://ogp.me/ns#">\r\n    <meta charset="utf-8" />\r\n    <meta name="viewport" content="width=device-width, initial-scale=1.0" />\r\n    <meta name="description" content="Microsoft customer stories. See how Microsoft tools help companies run their business.">\r\n    <meta name="keywords" content="Microsoft, customers, stories, business, software, tools, services, use case, global, collaboration, vendor, story sear .....

Let me know if this helps.

Upvotes: 0

Related Questions