Reputation: 13

Beautifulsoup extract no content python

Im having problems to parse this parked IONOS domain "epaviste-gratuit-paris.com" with beautifulsoup.

def parked(domain):
    domains = 'http://' + domain
    try:
        response = requests.get(domains, headers=headers_parked)
        if response.raise_for_status() is None:
            soup = BeautifulSoup(response.text, 'lxml')
            print(soup.prettify())
            
    except:
        pass

Always I get the same Output when I try to print soup.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
       "http://www.w3.org/TR/html4/loose.dtd">
<html>
 <head>
  <meta content="0;url=defaultsite" http-equiv="Refresh"/>
  <!-- FR -->
 </head>
 <body>
 </body>
</html>

I cant find any classes or content. This is the first time i see this problem. Does anybody have an idea?

Tried different headers / User Agents and 'html.parser' option.

Upvotes: 1

Answers (3)

DRA

Reputation: 168

Try to find a sample request in the network tab of devtool. There you can find sample requests sent. then copy the curl bash of the request, and use a website that converts what you have copied to a Python request. There you can see the headers set for the request. Sometimes you need the headers, and cookies to be set to be able to get a response from the server. These headers set, for example, you are requesting as a web browser, not a bot. The websites can have a firewall that can detect malicious activities. This always helps me. Sometimes I need to remove the headers to access the content, sometimes I have to set the headers. It depends on the website. But you usually need the user-agent to be set, and verify option of request module for example to be set as False. For example:

headers = {
   
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36',
    'x-xsrf-token': 'eyJpdiI6IllLblZaMS9ydE5YUnNlT1VEMXJtZnc9PSIsInZhbHVlIjoiQXQ0bE80aEdaa2wrR1B3YjdNdDRDK0JZaFF4VmVUa2tkRWFrMTVURUhQenFodXd1dEJ6SUJnd1FzdEl0UXlnVW92amY5NndSRFh2Q2RBQVhENjJGN2o3dlRGRUFNY0MzRll4Wjc5UVRwbzRMK09FMjdCNHl5NWxVdmFxZU1kMmIiLCJtYWMiOiIwNjU2YTJlOGViNGUwZTQ4YjYzNWIzNDA0MmIxOWM3Y2ZmODdiYzVhMDllOTllMDk1YmU2ZjFhN2QwMGIwYThkIiwidGFnIjoiIn0=',
}

Upvotes: 0

Jiu_Zou

Reputation: 571

use selenium to crawl the html page.

from selenium import webdriver
from bs4 import BeautifulSoup as soup
import pandas as pd
from selenium import webdriver
from selenium.webdriver.edge.service import Service
from selenium.webdriver import EdgeOptions
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By

url = r"https://birdeye.so/find-gems?chain=solana/"
service=Service(executable_path = r'C:\Users\10696\Desktop\access\zhihu\msedgedriver\msedgedriver.exe')
edge_options = EdgeOptions()
edge_options.add_experimental_option('excludeSwitches', ['enable-automation'])
edge_options.add_experimental_option('useAutomationExtension', False)
edge_options.add_argument('lang=zh-CN,zh,zh-TW,en-US,en')
edge_options.add_argument("disable-blink-features=AutomationControlled")#

driver = webdriver.Edge(options=edge_options, service = service)
driver.get(url)
pag = driver.find_element(By.TAG_NAME, "body")
pag = driver.execute_script("return arguments[0].innerHTML;", pag)

table = soup(pag, "lxml")

Upvotes: 0

Andrej Kesely

Reputation: 195408

If you look at the HTML source, you see the <meta> tag with the URL for redirecting (W3 link):

You can parse this new URL for the page content:

import requests
from bs4 import BeautifulSoup

url = "http://epaviste-gratuit-paris.com"

response = requests.get(url)
response.raise_for_status()

soup = BeautifulSoup(response.content, "html.parser")

url2 = url + "/" + soup.find("meta")["content"].split("=")[-1]
soup = BeautifulSoup(requests.get(url2).content, "html.parser")

print(soup.h1.text.strip())

Prints:

Ce domaine est déjà enregistré

Upvotes: 0

Beautifulsoup extract no content python

Answers (3)

Related Questions