Reputation: 53
<div class="secondary">
<dl>
<div><dt>Joined</dt><dd><span class="relative-date date" title="Nov 2, 2019 9:24 pm" data-time="1572701042645" data-format="medium">Nov 2, '19</span></dd></div>
<div><dt>Last Post</dt><dd><span class="relative-date date" title="Nov 1, 2020 4:21 pm" data-time="1604218868661" data-format="medium">18 hours</span></dd></div>
<div><dt>Seen</dt><dd><span class="relative-date date" title="Nov 2, 2020 10:38 am" data-time="1604284735243" data-format="medium">12 mins</span></dd></div>
<div><dt>Views</dt><dd>546</dd></div>
<!----> <div><dt class="trust-level">Trust Level</dt><dd class="trust-level">Member</dd></div>
<!----> <div><dt class="groups">Groups</dt>
<dd class="groups">
<span><a href="/g/Programmers" id="ember47" class="group-link ember-view">Programmers</a></span>
<span><a href="/g/Web_Developer" id="ember49" class="group-link ember-view">Web_Developer</a></span>
<a href="/g?username=OctaLua" id="ember50" class="ember-view"> ...
</a> </dd>
</div>
<!----> </dl>
<span id="ember51" class="ember-view"> <div id="ember53" class="user-profile-secondary-outlet follow-statistics-user ember-view"><!----></div>
</span>
</div>
so I am trying to get the "secondary" class using the Python BeautifulSoup4 Library
page = requests.get('https://devforum.roblox.com/u/octalua').content
soup = BeautifulSoup(page, 'html.parser')
content = soup.find('div', {'class': 'secondary'})
print(content)
but whenever I print the content it keeps printing none even though I defined the class already, if you wish to check the URL its at the python code thanks.
Upvotes: 1
Views: 73
Reputation: 5531
That part of the webpage is loaded dynamically, so you have to use selenium
in order to scrape it:
from bs4 import BeautifulSoup
from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get('https://devforum.roblox.com/u/octalua')
time.sleep(3)
soup = BeautifulSoup(driver.page_source, 'html.parser')
content = soup.find('div', {'class': 'secondary'})
print(content)
driver.close()
Output:
<div class="secondary">
<dl>
<div><dt>Joined</dt><dd><span class="relative-date date" data-format="medium" data-time="1572701042645" title="Nov 2, 2019 6:54 pm">Nov 2, '19</span></dd></div>
<div><dt>Last Post</dt><dd><span class="relative-date date" data-format="medium" data-time="1604218868661" title="Nov 1, 2020 1:51 pm">19 hours</span></dd></div>
<div><dt>Seen</dt><dd><span class="relative-date date" data-format="medium" data-time="1604284735243" title="Nov 2, 2020 8:08 am">19 mins</span></dd></div>
<div><dt>Views</dt><dd>550</dd></div>
<!-- --> <div><dt class="trust-level">Trust Level</dt><dd class="trust-level">Member</dd></div>
<!-- --> <div><dt class="groups">Groups</dt>
<dd class="groups">
<span><a class="group-link ember-view" href="/g/Programmers" id="ember47">Programmers</a></span>
<span><a class="group-link ember-view" href="/g/Web_Developer" id="ember49">Web_Developer</a></span>
<a class="ember-view" href="/g?username=OctaLua" id="ember50"> ...
</a> </dd>
</div>
<!-- --> </dl>
<span class="ember-view" id="ember51"> <div class="user-profile-secondary-outlet follow-statistics-user ember-view" id="ember53"><!-- --></div>
</span>
</div>
Edit:
You can also do the same using the json
file. Here is the code:
import requests
import pandas as pd
dictt = requests.get('https://devforum.roblox.com/u/octalua/summary.json').json()
lst = dictt['topics']
final = {}
needed_keys = ["id","posts_count","reply_count","last_posted_at"]
for dictionary in lst:
for key in dictionary.keys():
if key in needed_keys:
if set(needed_keys).issubset(dictionary.keys()):
final.setdefault(key,[]).append(dictionary[key])
else:
if key not in dictionary.keys():
final.setdefault(key, []).append(float("nan"))
df = pd.DataFrame(final,index=final['id'])
df = df.drop('id', axis = 1)
print(df)
Output:
posts_count reply_count last_posted_at
777375 5 1 2020-09-19T10:09:30.064Z
571759 9 6 2020-05-14T12:15:38.374Z
626599 9 4 2020-06-15T17:24:31.469Z
610010 4 0 2020-06-04T07:24:15.153Z
593138 2 1 2020-06-01T12:01:21.984Z
548304 4 0 2020-04-29T14:11:44.803Z
830091 2 0 2020-10-21T04:27:50.161Z
606410 25 23 2020-08-14T22:22:59.322Z
612874 7 4 2020-08-29T05:48:49.863Z
841094 11 5 2020-10-28T12:55:10.337Z
841110 7 4 2020-10-29T17:25:40.995Z
419774 4813 1983 2020-11-02T04:31:40.577Z
607078 10 6 2020-06-03T14:35:40.271Z
831553 11 6 2020-10-22T16:07:17.877Z
Upvotes: 1
Reputation: 9969
This should work since it's loaded dynamically.
driver.get('https://devforum.roblox.com/u/octalua')
elem=WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "secondary")))
print(elem.text)#or .content
Import
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
Outputs
Joined
Nov 2, '19
Last Post
19 hours
Seen
26 mins
Views
556
Trust Level
Member
Groups
Programmers Web_Developer ...
Upvotes: 0