Ahmadreza
Ahmadreza

Reputation: 384

i can not get the body element of html page in web scraping by python

I would like to parse a website with urllib python library. I wrote this:

from bs4 import BeautifulSoup
from urllib.request import HTTPCookieProcessor, build_opener
from http.cookiejar import FileCookieJar


def makeSoup(url):
    jar = FileCookieJar("cookies")
    opener = build_opener(HTTPCookieProcessor(jar))
    html = opener.open(url).read()
    return BeautifulSoup(html, "lxml")


def articlePage(url):
    return makeSoup(url)


Links = "http://collegeprozheh.ir/%d9%85%d9%82%d8%a7%d9%84%d9%87-   %d9%85%d8%af%d9%84-%d8%b1%d9%82%d8%a7%d8%a8%d8%aa%db%8c-%d8%af%d8%b1-%d8%b5%d9%86%d8%b9%d8%aa-%d9%be%d9%86%d9%84-%d9%87%d8%a7%db%8c-%d8%ae%d9%88%d8%b1%d8%b4%db%8c%d8%af/"
print(articlePage(Links))

but the website does not return content of body tag. this is result of my program:

cURL = window.location.href;
var p = new Date();
second = p.getTime();
GetVars = getUrlVars();

setCookie("Human" , "15421469358743" , 10);
check_coockie = getCookie("Human");

if (check_coockie != "15421469358743")
        document.write("Could not Set cookie!");
else
        window.location.reload(true);


</script>
</head><body></body>
</html>

i think the cookie has caused this problem.

Upvotes: 0

Views: 182

Answers (1)

DocZer&#248;
DocZer&#248;

Reputation: 8557

The page is using JavaScript to check the cookie and to generate the content. However, urllib does not process JavaScript and thus the page shows nothing.

You'll either need to use something like Selenium that acts as a browser and executes JavaScript, or you'll need to set the cookie yourself before you request the page (from what I can see, that's all the JavaScript code does). You seem to be loading a file containing cookie definitions (using FileCookieJar), however you haven't included the content.

Upvotes: 1

Related Questions