naveen malla
naveen malla

Reputation: 65

How to get all comments in 9gag using selenium?

I'm working on scraping the memes and all their comments from 9gag. I used this code below but I am only getting few extra comments.

actions = ActionChains(driver)
link = driver.find_element(By.XPATH, "//button[@class='comment-list__load-more']")
actions.move_to_element(link).click(on_element=link).perform()

I would also like to access the subcomments under a comment by simulating click on view more replies.

From the html I found this XPATH element = driver.find_element(By.XPATH, "//div[@class='vue-recycle-scroller ready page-mode direction-vertical']")holds the comments section but I'm not sure how to iterate through each comment in this element and simulate these clicks.

This code should work directly provided the necessary libraries are present in case you wanna test it.

Please help me with these following tasks:

  1. Getting all the comments from view all comments
  2. Iterating through each comment section and clicking on view more replies to get all the subcomments

My Code

import time
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
import undetected_chromedriver as uc

if __name__ == '__main__':

    options = Options()
    # options.headless = True
    options.add_argument("start-maximized")  # ensure window is full-screen
    driver = uc.Chrome(service=Service(ChromeDriverManager().install()), options=options)
    driver.get("https://9gag.com/gag/a5EAv9O")
    prev_h = 0
    for i in range(10):
        height = driver.execute_script("""
                   function getActualHeight() {
                       return Math.max(
                           Math.max(document.body.scrollHeight, document.documentElement.scrollHeight),
                           Math.max(document.body.offsetHeight, document.documentElement.offsetHeight),
                           Math.max(document.body.clientHeight, document.documentElement.clientHeight)
                       );
                   }
                   return getActualHeight();
               """)
        driver.execute_script(f"window.scrollTo({prev_h},{prev_h + 200})")
        time.sleep(1)
        prev_h += 200
        if prev_h >= height:
            break
    time.sleep(5)
    title = driver.title[:-7]
    try:
        upvotes_count = \
        driver.find_element(By.XPATH, "//meta[@property='og:description']").get_attribute("content").split(' ')[0]
        comments_count = \
        driver.find_element(By.XPATH, "//meta[@property='og:description']").get_attribute("content").split(' ')[3]
        upvotes_count = int(upvotes_count) if len(upvotes_count) <= 3 else int("".join(upvotes_count.split(',')))
        comments_count = int(comments_count) if len(comments_count) <= 3 else int("".join(comments_count.split(',')))
        date_posted = driver.find_element(By.XPATH, "//p[@class='message']")
        date_posted = date_posted.text.split("·")[1].strip()
        # actions = ActionChains(driver)
        # link = driver.find_element(By.XPATH, "//button[@class='comment-list__load-more']")
        # actions.move_to_element(link).click(on_element=link).perform()
        element = driver.find_element(By.XPATH,
                                      "//div[@class='vue-recycle-scroller ready page-mode direction-vertical']")
        print(element.text)
        driver.quit()
    except NoSuchElementException or Exception as err:
        print(err)

Output Output

Edit:

I managed to make the code work better. It scrolls through the page until it sees all the comments. It also clicks on view more replies if there are subcomments.

But it's only able to read comments from middle to end. Maybe as the page is scrolled down, the initial comments are hidden dynamically. I do not know how to overcome this. And clicking on view more replies stops after some clicks and is throwing the error

selenium.common.exceptions.MoveTargetOutOfBoundsException: Message: move target out of bounds

Here's the updated code

import driver as driver
from selenium.webdriver.remote.webelement import WebElement
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
import time
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException, ElementClickInterceptedException
from selenium.webdriver.support.wait import WebDriverWait
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
import undetected_chromedriver as uc

def scroll_page(scrl_hgt):
    prev_h = 0
    for i in range(10):
        height = driver.execute_script("""
                       function getActualHeight() {
                           return Math.max(
                               Math.max(document.body.scrollHeight, document.documentElement.scrollHeight),
                               Math.max(document.body.offsetHeight, document.documentElement.offsetHeight),
                               Math.max(document.body.clientHeight, document.documentElement.clientHeight)
                           );
                       }
                       return getActualHeight();
                   """)
        driver.execute_script(f"window.scrollTo({prev_h},{prev_h + scrl_hgt})")
        time.sleep(1)
        prev_h += scrl_hgt
        if prev_h >= height:
            break

if __name__ == '__main__':
    options = Options()
    # options.headless = True
    driver = uc.Chrome(service=Service(ChromeDriverManager().install()), options=options)
    driver.maximize_window()
    driver.get("https://9gag.com/gag/a5EAv9O")
    time.sleep(5)

    # click on I accept cookies
    actions = ActionChains(driver)
    consent_button = driver.find_element(By.XPATH, '//*[@id="qc-cmp2-ui"]/div[2]/div/button[2]')
    actions.move_to_element(consent_button).click().perform()

    scroll_page(150)
    time.sleep(2)

    # click on fresh comments sectin
    fresh_comments = driver.find_element(By.XPATH, '//*[@id="page"]/div[1]/section[2]/section/header/div/button[2]')
    actions.move_to_element(fresh_comments).click(on_element=fresh_comments).perform()

    time.sleep(5)

    # getting meta data
    title = driver.title[:-7]
    upvotes_count = driver.find_element(By.XPATH, "//meta[@property='og:description']").get_attribute("content").split(' ')[0]
    comments_count = driver.find_element(By.XPATH, "//meta[@property='og:description']").get_attribute("content").split(' ')[3]
    upvotes_count = int(upvotes_count) if len(upvotes_count) <= 3 else int("".join(upvotes_count.split(',')))
    comments_count = int(comments_count) if len(comments_count) <= 3 else int("".join(comments_count.split(',')))
    date_posted = driver.find_element(By.XPATH, "//p[@class='message']")
    date_posted = date_posted.text.split("·")[1].strip()

    time.sleep(3)

    # click on lood more comments button to load all the comments
    load_more_comments = driver.find_element(By.XPATH, "//button[@class='comment-list__load-more']")
    actions.move_to_element(load_more_comments).click(on_element=load_more_comments).perform()

    scroll_page(500)

    print([my_elem.text for my_elem in driver.find_elements(By.CSS_SELECTOR, "div.comment-list-item__text")])

    comments = driver.find_elements(By.CSS_SELECTOR, "div.vue-recycle-scroller__item-view")
    for item in comments:
        html = item.get_attribute("innerHTML")
        if "comment-list-item__text" in html:
            print(item.find_element(By.CSS_SELECTOR, "div.comment-list-item__text").text)
        elif "comment-list-item__deleted-text" in html:
            print(item.find_element(By.CSS_SELECTOR, "div.comment-list-item__deleted-text").text)

        # get sub comments
        if "comment-list-item__replies" in html:
            #item.find_element(By.CSS_SELECTOR, "div.comment-list-item__replies").click()
            sub_comments = item.find_element(By.CSS_SELECTOR, "div.comment-list-item__replies")
            actions.move_to_element(sub_comments).click(on_element=sub_comments).perform()
        time.sleep(2)
    driver.quit()


PS: My goal is to get every single comments and all their sub comments (whether they are text, image, gif, etc) in the order they appear and save them somewhere so that I should be able to recreate the comments section again.

Upvotes: 0

Views: 585

Answers (1)

undetected Selenium
undetected Selenium

Reputation: 193088

To extract and print the comment texts you need to induce WebDriverWait for visibility_of_all_elements_located() and you can use the following Locator Strategies:

driver.get("https://9gag.com/gag/a5EAv9O")
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button.comment-list__load-more"))).click()
print([my_elem.text for my_elem in driver.find_elements(By.CSS_SELECTOR, "div.comment-list-item__text")])

Console Output:

['Man, the battle of the cults is getting interesting now.', 'rent free in your head', 'Sorry saving all my money up for the Joe Biden Depends Multipack and the Karmella knee pads.', "It's basically a cult now.", "I'll take one. I'm not even American", '', 'that eagle looks familiar.', "Who doesn't want a trump card?"]

Note : You have to add the following imports :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

Upvotes: 1

Related Questions