Reputation: 461

Youtube scraping with selenium :not getting all comments

I am trying to scrape youtube comments using selenium with python. Below is the code which scrapes just the one comment and throws error

driver = webdriver.Chrome()
url="https://www.youtube.com/watch?v=MNltVQqJhRE"
driver.get(url)

wait(driver, 5500)

driver.execute_script("window.scrollTo(0, document.body.scrollHeight + 500);")
driver.implicitly_wait(5000)

#content = driver.find_element_by_xpath('//*[@id="contents"]')
comm=driver.find_element_by_xpath('//div[@class="style-scope ytd-item-section-renderer"]')
comm1=comm.find_elements_by_xpath('//yt-formatted-string[@id="content-text"]')
#print(comm.text)
for i in range(50):
    print(comm1[i].text,end=' ')

This is the output I am getting. How do I get all the comments on that page??? Can anyone help me with this.

 Being a sucessful phyton freelancer really mean to me because if I able to make $2000 in month I can really help my family financial, improve my skill, and have a lot of time to refreshing. So thanks Qazi, you really help me :D 

Traceback (most recent call last):
  File "C:\Python36\programs\Web scrap\YT_Comm.py", line 19, in <module>
    print(comm1[i].text,end=' ')
IndexError: list index out of range

Upvotes: 0

Answers (1)

Ian Lesperance

Reputation: 5139

An IndexError means you’re attempting to access a position in a list that doesn’t exist. You’re iterating over your list of elements (comm1) exactly 50 times, but there are fewer than 50 elements in the list, so eventually you attempt to access an index that doesn’t exist.

Superficially, you can solve your problem by changing your iteration to loop over exactly as many elements as exist in your list—no more and no less:

for element in comm1:
    print(element.text, end=‘ ‘)

But that leaves you with the problem of why your list has fewer than 50 elements. The video you’re scraping has over 90 comments. Why doesn’t your list have all of them?

If you take a look at the page in your browser, you'll see that the comments load progressively using the infinite scroll technique: when the user scrolls to the bottom of the document, another "page" of comments are fetched and rendered, increasing the length of the document. To load more comments, you will need to trigger this behavior.

But depending on the number of comments, one fetch may not be enough. In order to trigger the fetch and rendering of all of the content, then, you will need to:

attempt to trigger a fetch of additional content, then
determine whether additional content was fetched, and, if so,
repeat (because there might be even more).

Triggering a fetch

We already know that additional content is fetched by scrolling to the bottom of the content container (the element with id #contents), so let's do that:

driver.execute_script(
    "window.scrollTo(0, document.querySelector('#contents').scrollHeight);")

(Note: Because the content resides in an absolute-positioned element, document.body.scrollHeight will always be 0 and will not trigger a scroll.)

Waiting for the content container

But as with any browser automation, we're in a race with the application: What if the content container hasn't rendered yet? Our scroll would fail.

Selenium provides WebDriverWait() to help you wait for the application to be in a particular state. It also provides, via its expected_conditions module, a set of common states to wait for, such as the presence of an element. We can use both of these to wait for the content container to be present:

from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait

TIMEOUT_IN_SECONDS = 10

wait = WebDriverWait(driver, TIMEOUT_IN_SECONDS)
wait.until(
    EC.presence_of_element_located((By.CSS_SELECTOR, "#contents")))

Determining whether additional content was fetched

At a high level, we can determine whether additional content was fetched by:

counting the content before we trigger the fetch,
counting the content after we trigger the fetch, then
comparing the two.

Counting the content

Within our container (with id "#contents"), each piece of content has id #content. To count the content, we can simply fetch each of those elements and use Python's built-in len():

count = len(driver.find_elements_by_css_selector("#contents #content")

Handling a slow render

But again, we're in a race with the application: What happens if either the fetch or the render of additional content is slow? We won't immediately see it.

We need to give the web application time to do its thing. To do this, we can use WebDriverWait() with a custom condition:

def get_count():
    return len(driver.find_elements_by_css_selector("#contents #content"))

count = get_count()
# ...
wait.until(
    lambda _: get_count() > count)

Handling no additional content

But what if there isn't any additional content? Our wait for the count to increase will timeout.

As long as our timeout is high enough to allow sufficient time for the additional content to appear, we can assume that there is no additional content and ignore the timeout:

try:
    wait.until(
        lambda _: get_count() > count)
except TimeoutException:
    # No additional content appeared. Abort our loop.
    break

Putting it all together

from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait

TIMEOUT_IN_SECONDS = 10

wait = WebDriverWait(driver, TIMEOUT_IN_SECONDS)

driver.get(URL)

wait.until(
    EC.presence_of_element_located((By.CSS_SELECTOR, "#contents")))

def get_count():
    return len(driver.find_elements_by_css_selector("#contents #content"))

while True:
    count = get_count()
    driver.execute_script(
        "window.scrollTo(0, document.querySelector('#contents').scrollHeight);")
    try:
        wait.until(
            lambda _: get_count() > initial_count)
    except TimeoutException:
        # No additional content appeared. Abort our loop.
        break

elements = driver.find_elements_by_css_selector("#contents #content")

Bonus: Simplifying with capybara-py

With capybara-py, this becomes a bit simpler:

import capybara
from capybara.dsl import page
from capybara.exceptions import ExpectationNotMet

@capybara.register_driver("selenium_chrome")
def init_selenium_chrome_driver(app):
    from capybara.selenium.driver import Driver
    return Driver(app, browser="chrome")

capybara.current_driver = "selenium_chrome"
capybara.default_max_wait_time = 10

page.visit(URL)

contents = page.find("#contents")

elements = []
while True:
    try:
        elements = contents.find_all("#content", minimum=len(elements) + 1)
    except ExpectationNotMet:
        # No additional content appeared. Abort our loop.
        break

    page.execute_script(
        "window.scrollTo(0, arguments[0].scrollHeight);", contents)

Upvotes: 5