Reputation: 584
I am working on a project which requires me to view a webpage, but to use the HTML further, I have to see it fully and not as a bunch of lines mixed in with pictures. Is there a way to parse the CSS along with the HTML using BeautifulSoup?
Here is my code:
from bs4 import BeautifulSoup
def get_html(url, name):
r = requests.get(url)
r.encoding = 'utf8'
return r.text
link = 'https://www.labirint.ru/books/255282/'
with open('labirint.html', 'w', encoding='utf-8') as file:
file.write(get_html(link, '255282'))
WARNING: The page: https://www.labirint.ru/books/255282/ has a redirect to https://www.labirint.ru/books/733371/.
Upvotes: 2
Views: 3289
Reputation: 34
Here is the python function which I use (extracts CSS from external stylesheets, script tags and inline CSS):
import urllib.parse
from typing import Optional
import requests
from bs4 import BeautifulSoup
def extract_css_from_webpage(
url: str, request_kwargs: Optional[dict] = None, verbose: bool = False
) -> tuple[list[str], list[str], list[dict]]:
"""Extracts CSS from webpage
Args:
url (str): Webpage URL
request_kwargs (dict): These arguments are passed to requests.get() (when
fetching webpage HTML and external stylesheets)
verbose (bool): Print diagnostic information
Returns:
tuple[ list[str], list[str], list[dict] ]: css_from_external_stylesheets, css_from_style_tags, inline_css
"""
if not request_kwargs:
request_kwargs = {
"timeout": 10,
"headers": {"User-Agent": "Definitely not an Automated Script"},
}
url_response = requests.get(url, **request_kwargs)
if url_response.status_code != 200:
raise requests.exceptions.HTTPError(
f"received response [{url_response.status_code}] from [{url}]"
)
soup = BeautifulSoup(url_response.content, "html.parser")
css_from_external_stylesheets: list[str] = []
for link in soup.find_all("link", rel="stylesheet"):
css_url = urllib.parse.urljoin(url, link["href"])
if verbose:
print(f"downloading external CSS stylesheet {css_url}")
css_content: str = requests.get(css_url, **request_kwargs).text
css_from_external_stylesheets.append(css_content)
css_from_style_tags: list[str] = []
for style_tag in soup.find_all("style"):
css_from_style_tags.append(style_tag.string)
inline_css: list[dict] = []
for tag in soup.find_all(style=True):
inline_css.append({"tag": str(tag), "css": tag["style"]})
if verbose:
print(
f"""Extracted the following CSS from [{url}]:
1. {len(css_from_external_stylesheets):,} external stylesheets (total {len("".join(css_from_external_stylesheets)):,} characters of text)
2. {len(css_from_style_tags):,} style tags (total {len("".join(css_from_style_tags)):,} characters of text)
3. {len(inline_css):,} tags with inline CSS (total {len("".join( (x["css"] for x in inline_css) )):,} characters of text)
"""
)
return css_from_external_stylesheets, css_from_style_tags, inline_css
Upvotes: 0
Reputation: 61
If your goal is to truly parse the css:
Beautiful soup will pull the entire page - and it does include the header, styles, scripts, linked in css and js, etc. I have used the method in the pythonCodeArticle before and retested it for the link you provided.
import requests
from bs4 import BeautifulSoup as bs
from urllib.parse import urljoin
# URL of the web page you want to extract
url = "ENTER YOUR LINK HERE"
# initialize a session & set User-Agent as a regular browser
session = requests.Session()
session.headers["User-Agent"] = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
# get the HTML content
html = session.get(url).content
# parse HTML using beautiful soup
soup = bs(html, "html.parser")
print(soup)
By looking at the soup output (It is very long, I will not paste here).. you can see it is a complete page. Just make sure to paste in your specific link
NOW If you wanted to parse the result to pick up all css urls.... you can add this: (I am still using parts of the code from the very well described python Code article link above)
# get the CSS files
css_files = []
for css in soup.find_all("link"):
if css.attrs.get("href"):
# if the link tag has the 'href' attribute
css_url = urljoin(url, css.attrs.get("href"))
css_files.append(css_url)
print(css_files)
The output css_files will be a list of all css files. You can now go visit those separately and see the styles that are being imported.
NOTE:this particular site has a mix of styles inline with the html (i.e. they did not always use css to set the style properties... sometimes the styles are inside the html content.)
This should get you started.
Upvotes: 2