Reputation: 179
I am trying to scrape from this website, jump to each href
article and scrape the comments located right after the main body text. However, I am getting blank results. Ive also tried fetching all li
by writing soup.find_all('li')
to check if any comments exist and found out that even extracting all li did not contain any comments about the article. Can anyone advice please? Im suspecting the website is making it harder to get the text.
import requests
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas as pd
urls = [
'https://hypebeast.com/brands/jordan-brand'
]
with requests.Session() as s:
for url in urls:
driver = webdriver.Chrome('/Users/Documents/python/Selenium/bin/chromedriver')
driver.get(url)
products = [element for element in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='post-box ']")))]
soup = bs(driver.page_source, 'lxml')
element = soup.select('.post-box ')
time.sleep(1)
ahref = [item.find('a')['href'] for item in element]
results = list(zip(ahref))
df = pd.DataFrame(results)
for result in results:
res = driver.get(result[0])
soup = bs(driver.page_source, 'lxml')
time.sleep(6)
comments_href = soup.find_all('ul', {'id': 'post-list'})
print(comments_href)
Upvotes: 0
Views: 180
Reputation: 28640
The post/comments are in an <iframe>
tag. The tag also has a dynamic attribute that starts with dsq-app
. So what you'll need to do is locate that iframe, switch to it, then you can parse. I choose to use BeautifulSoup to pull out the script
tag, read that in as a json format and navigate through there. This should hopfully get you going with pull what you're looking for:
import requests
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas as pd
import json
urls = [
'https://hypebeast.com/brands/jordan-brand'
]
with requests.Session() as s:
for url in urls:
driver = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')
driver.get(url)
products = [element for element in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='post-box ']")))]
soup = bs(driver.page_source, 'lxml')
element = soup.select('.post-box ')
time.sleep(1)
ahref = [item.find('a')['href'] for item in element]
results = list(zip(ahref))
df = pd.DataFrame(results)
for result in ahref:
driver.get(result)
time.sleep(6)
iframe = driver.find_element_by_xpath('//iframe[starts-with(@name, "dsq-app")]')
driver.switch_to.frame(iframe)
soup = bs(driver.page_source, 'html.parser')
scripts = soup.find_all('script')
for script in scripts:
if 'response' in script.text:
jsonStr = script.text
jsonData = json.loads(jsonStr)
for each in jsonData['response']['posts']:
author = each['author']['username']
message = each['raw_message']
print('%s: %s' %(author, message))
Output:
annvee: Lemme get them BDSM jordans fam
deathb4designer: Lmao
zenmasterchen: not sure why this model needed to exist in the first place
Spawnn: Issa flop.
disqus_lEPADa2ZPn: looks like an AF1
Lekkerdan: Hoodrat shoes.
rubnalntapia: Damn this are sweet
marcellusbarnes: Dope, and I hate Jordan lows
marcellusbarnes: The little jumpman on the back is dumb
chickenboihotsauce: copping those CPFM gonna be aids
lowercasegod: L's inbound
monalisadiamante: Sold out in 4 minutes. 😑
nickpurita: Those CPFM’s r overhyped AF.
...
Upvotes: 1