How to extract hidden li text

Question

I am trying to scrape from this website, jump to each href article and scrape the comments located right after the main body text. However, I am getting blank results. Ive also tried fetching all li by writing soup.find_all('li') to check if any comments exist and found out that even extracting all li did not contain any comments about the article. Can anyone advice please? Im suspecting the website is making it harder to get the text.

import requests
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas as pd

urls = [
    'https://hypebeast.com/brands/jordan-brand'
]

with requests.Session() as s:
    for url in urls:
        driver = webdriver.Chrome('/Users/Documents/python/Selenium/bin/chromedriver')
        driver.get(url)
        products = [element for element in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='post-box    ']")))]
        soup = bs(driver.page_source, 'lxml')
        element = soup.select('.post-box    ')
        time.sleep(1)
        ahref = [item.find('a')['href']  for item in element]
        results = list(zip(ahref))
        df = pd.DataFrame(results)
        for result in results:
            res = driver.get(result[0])
            soup = bs(driver.page_source, 'lxml')
            time.sleep(6)
            comments_href = soup.find_all('ul', {'id': 'post-list'})
            print(comments_href)

chitown88 · Accepted Answer

The post/comments are in an </code> tag. The tag also has a dynamic attribute that starts with <code>dsq-app</code>. So what you'll need to do is locate that iframe, switch to it, then you can parse. I choose to use BeautifulSoup to pull out the <code>script</code> tag, read that in as a json format and navigate through there. This should hopfully get you going with pull what you're looking for: <pre><code>import requests from bs4 import BeautifulSoup as bs from selenium import webdriver from selenium.webdriver.common.by import By from selenium.common.exceptions import TimeoutException from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC import time import pandas as pd import json urls = [ 'https://hypebeast.com/brands/jordan-brand' ] with requests.Session() as s: for url in urls: driver = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe') driver.get(url) products = [element for element in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='post-box ']")))] soup = bs(driver.page_source, 'lxml') element = soup.select('.post-box ') time.sleep(1) ahref = [item.find('a')['href'] for item in element] results = list(zip(ahref)) df = pd.DataFrame(results) for result in ahref: driver.get(result) time.sleep(6) iframe = driver.find_element_by_xpath('//iframe[starts-with(@name, "dsq-app")]') driver.switch_to.frame(iframe) soup = bs(driver.page_source, 'html.parser') scripts = soup.find_all('script') for script in scripts: if 'response' in script.text: jsonStr = script.text jsonData = json.loads(jsonStr) for each in jsonData['response']['posts']: author = each['author']['username'] message = each['raw_message'] print('%s: %s' %(author, message)) </code></pre> Output: <pre><code>annvee: Lemme get them BDSM jordans fam deathb4designer: Lmao zenmasterchen: not sure why this model needed to exist in the first place Spawnn: Issa flop. disqus_lEPADa2ZPn: looks like an AF1 Lekkerdan: Hoodrat shoes. rubnalntapia: Damn this are sweet marcellusbarnes: Dope, and I hate Jordan lows marcellusbarnes: The little jumpman on the back is dumb chickenboihotsauce: copping those CPFM gonna be aids lowercasegod: L's inbound monalisadiamante: Sold out in 4 minutes. 😑 nickpurita: Those CPFM’s r overhyped AF. ... </code></pre>

How to extract hidden li text

Answers (1)

Related Questions