XRaycat
XRaycat

Reputation: 1140

Scraping a dynamic website with Selenium/BeautifulSoup

I'm trying to scrape comments from a website using Selinium and Beutifulsoup. The site im trying to scrape from is genereted dynamicly by Javascript and that is little beyond what i've learned in the tutorials i've seen(im very little familiar with javascript). My best working solution so far is:

browser = webdriver.Chrome(executable_path=chromedriver_path)
browser.get('https://nationen.ebcomments.dk/embed/stream?asset_id=7627366')
def load_data():
    time.sleep(1) # The site needs to load
    browser.execute_script("document.querySelector('#stream > div.talk-stream-tab-container.Stream__tabContainer___2trkn > div:nth-child(2) > div > div > div > div > div:nth-child(3) > button').click()") # Click on load more comments button

htmlSource = browser.page_source
soup = BeautifulSoup(browser.page_source, 'html.parser')
load_data() # i should call this few times to load all comments, but in this example i only do it once.
for text in soup.findAll(class_="talk-plugin-rich-text-text"):
    print(text.get_text(), "\n") # Print the comments

It works - but it's very slow, and I'm sure that there is a better solution, especially if I want to scrape several hundreds of articles with comments.

I think all the comments comes in JSON format(i have looked into Chromes dev tab under network, and I can see there is a response containing the JSON with the comment - see the pic). Then I tried to use SeliniumRequest to get the data, but not sure at all what I'm doing, and it's not working. It says "b'POST body missing. Did you forget to use body-parser middleware?'". Maybe I could get the JSON from the comments API, but I'm not sure if it's possible?

pic

from seleniumrequests import Chrome
chromedriver_path = 'C:/chromedriver.exe'
webdriver = Chrome(executable_path=chromedriver_path)
response = webdriver.request('POST', 'https://nationen.ebcomments.dk/api/v1/graph/ql/', data={"assetId": "7627366", "assetUrl": "", "commentId": "","excludeIgnored": "false","hasComment": "false", "sortBy": "CREATED_AT", "sortOrder": "DESC"})

Upvotes: 2

Views: 257

Answers (1)

SIM
SIM

Reputation: 22440

If only the comments you are after then the following implementation should get you there:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

link = "https://nationen.ebcomments.dk/embed/stream?asset_id=7627366"

with webdriver.Chrome() as driver:
    wait = WebDriverWait(driver,10)
    driver.get(link)
    while True:
        try:
            wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,".talk-load-more > button"))).click()
        except Exception: break

    for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,"[data-slot-name='commentContent'] > .CommentContent__content___ZGv1q"))):
        print(item.text)

Upvotes: 2

Related Questions