Reputation: 65
I am trying to extract the article body with images from this link, so that using the extracted article body I can make a HTML table. So, I have tried using BeautifulSoup
.
t_link = 'https://www.cnbc.com/2022/01/03/5-ways-to-reset-your-retirement-savings-and-save-more-money-in-2022.html'
page = requests.get(t_link)
soup_page = BeautifulSoup(page.content, 'html.parser')
html_article = soup_page.find_all("div", {"class": re.compile('ArticleBody-articleBody.?')})
for article_body in html_article:
print(article_body)
But unfortunately the article_body
didn't show any image, like this. Because, <div class="InlineImage-wrapper">
is't scraping in this way
So, how can I get article data with article images, so that I can make a HTML table?
Upvotes: 0
Views: 542
Reputation: 5688
I didn't quite understand your goal, so mine is probably not the answer you want.
In the html source of that page you have all inside the script you at the bottom.
It has inside the content of the page in JSON format.
If you simply use grep
and jq
(a great JSON cli utility), you can run
curl -kL "https://www.cnbc.com/2022/01/03/5-ways-to-reset-your-retirement-savings-and-save-more-money-in-2022.html" | \
grep -Po '"body":.+"body".' | \
grep -Po '{"content":\[.+"body".' | \
jq '[.content[]|select(.tagName|contains("image"))]'
to have all infos about the images
[
{
"tagName": "image",
"attributes": {
"id": "106967852",
"type": "image",
"creatorOverwrite": "PM Images",
"headline": "Retirement Savings",
"url": "https://image.cnbcfm.com/api/v1/image/106967852-1635524865061-GettyImages-1072593728.jpg?v=1635525026",
"datePublished": "2021-10-29T16:30:26+0000",
"copyrightHolder": "PM Images",
"width": "2233",
"height": "1343"
},
"data": {
"__typename": "image"
},
"children": [],
"__typename": "bodyContent"
},
{
"tagName": "image",
"attributes": {
"id": "106323101",
"type": "image",
"creatorOverwrite": "JGI/Jamie Grill",
"headline": "GP: 401k money jar on desk of businesswoman",
"url": "https://image.cnbcfm.com/api/v1/image/106323101-1578344280328gettyimages-672157227.jpeg?v=1641216437",
"datePublished": "2020-01-06T20:58:19+0000",
"copyrightHolder": "JGI/Jamie Grill",
"width": "5120",
"height": "3418"
},
"data": {
"__typename": "image"
},
"children": [],
"__typename": "bodyContent"
}
]
If you need only the URLs, run
curl -kL "https://www.cnbc.com/2022/01/03/5-ways-to-reset-your-retirement-savings-and-save-more-money-in-2022.html" | \
grep -Po '"body":.+"body".' | \
grep -Po '{"content":\[.+"body".' | \
jq -r '[.content[]|select(.tagName|contains("image"))]|.[].attributes.url'
to get
https://image.cnbcfm.com/api/v1/image/106967852-1635524865061-GettyImages-1072593728.jpg?v=1635525026
https://image.cnbcfm.com/api/v1/image/106323101-1578344280328gettyimages-672157227.jpeg?v=1641216437
Upvotes: 1
Reputation: 20042
Everything you want is in the source HTML
, but you need to jump through a couple of hoops to get that data.
I'm providing the following:
Here's how:
import json
import re
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:104.0) Gecko/20100101 Firefox/104.0",
}
with requests.Session() as s:
s.headers.update(headers)
url = "https://www.cnbc.com/2022/01/03/5-ways-to-reset-your-retirement-savings-and-save-more-money-in-2022.html"
script = [
s.text for s in
BeautifulSoup(s.get(url).text, "lxml").find_all("script")
if "window.__s_data" in s.text
][0]
payload = json.loads(
re.match(r"window\.__s_data=(.*);\swindow\.__c_data=", script).group(1)
)
article_data = (
payload
["page"]
["page"]
["layout"][3]
["columns"][0]
["modules"][2]
["data"]
)
print(article_data["articleBodyText"])
for item in article_data["body"]["content"]:
if "url" in item["attributes"].keys():
print(item["attributes"]["url"])
This should print:
The new year offers opportunities for many Americans in their careers and financial lives. The "Great Reshuffle" is expected to continue as employees leave jobs and take new ones at a rapid clip. At the same time, many workers have made a vow to save more this year, yet many admit they don't know how they'll stick to that goal. One piece of advice: Keep it simple.
[...]
The above mentioned urls to assets:
https://www.cnbc.com/video/2022/01/03/how-to-choose-the-best-retirement-strategy-for-2022.html
https://image.cnbcfm.com/api/v1/image/106967852-1635524865061-GettyImages-1072593728.jpg?v=1635525026
https://image.cnbcfm.com/api/v1/image/106323101-1578344280328gettyimages-672157227.jpeg?v=1641216437
EDIT:
If you want to download the images, use this:
import json
import os
import re
from pathlib import Path
from shutil import copyfileobj
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:104.0) Gecko/20100101 Firefox/104.0",
}
url = "https://www.cnbc.com/2022/01/03/5-ways-to-reset-your-retirement-savings-and-save-more-money-in-2022.html"
def download_images(image_source: str, directory: str) -> None:
"""Download images from a given source and save them to a given directory."""
os.makedirs(directory, exist_ok=True)
save_dir = Path(directory)
if re.match(r".*\.jp[e-g]", image_source):
file_name = save_dir / image_source.split("/")[-1].split("?")[0]
with s.get(image_source, stream=True) as img, open(file_name, "wb") as output:
copyfileobj(img.raw, output)
with requests.Session() as s:
s.headers.update(headers)
script = [
s.text for s in
BeautifulSoup(s.get(url).text, "lxml").find_all("script")
if "window.__s_data" in s.text
][0]
payload = json.loads(
re.match(r"window\.__s_data=(.*);\swindow\.__c_data=", script).group(1)
)
article_data = (
payload
["page"]
["page"]
["layout"][3]
["columns"][0]
["modules"][2]
["data"]
)
print(article_data["articleBodyText"])
for item in article_data["body"]["content"]:
if "url" in item["attributes"].keys():
url = item["attributes"]["url"]
print(url)
download_images(url, "images")
Upvotes: 0