Jansindl3r
Jansindl3r

Reputation: 399

Real page content isn't what I get with Requests and BeautifulSoup

as it happens sometimes to me, I can't access everything with requests that I can see on the page in the browser, and I would like to know why. On these pages, I am particularly interested in the comments. Does anyone have an idea how to access those comments, please? Thanks!

import requests
from bs4 import BeautifulSoup
import re

url='https://aukro.cz/uzivatel/paluska_2009?tab=allReceived&type=all&page=1'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
searched = soup.find_all('td', class_='col1')
print(searched)

Upvotes: 3

Views: 222

Answers (3)

Paras Mishra
Paras Mishra

Reputation: 336

To address your curiosity for QHarr's answer, Upon loading the URL in chrome browser, if you trace Network calls. You will find out, there post request on URL - https://aukro.cz/backend/api/users/profile?username=paluska_2009, whose response - a JSON, which contains your desired information.

This is a trivial way of scraping data. While web-scraping, in most of the sites, you'll find out part of page is loading through some other api calls. To find the URL and POST params for the request, chrome Network tools is handy tool.

Let me know, if you need any details further.

Upvotes: 0

QHarr
QHarr

Reputation: 84465

Worth knowing you can get the scoring info for the individual as JSON using POST request. Handle the JSON as you require.

import requests
import pandas as pd
import json
from pandas.io.json import json_normalize

headers = {
        'Content-Type': 'application/json',
         'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'
    }

url = 'https://aukro.cz/backend/api/users/profile?username=paluska_2009'
response = requests.post(url, headers=headers,data = "")
response.raise_for_status()
data = json_normalize(response.json())
df = pd.DataFrame(data)
df.to_csv(r'C:\Users\User\Desktop\Data.csv', sep=',', encoding='utf-8',index = False )

Sample view of JSON:

Upvotes: 4

Peter Bejan
Peter Bejan

Reputation: 425

I run your code and analized the content you have in page.

Seems like aukro.cz is built in Angular since it uses ng-app, therefore it's all dynamic content you apparently can't load using requests. You could try to use selenium in headless mode to scrape that part of content you are looking for.

Let me now if you need instructions for it.

Upvotes: 1

Related Questions