miatochai
miatochai

Reputation: 343

Trying to scrape a page with Selenium and ChromeDriver. It loads the page but then times out

I'm trying to scrape all that's inside the html tag.

Basically it gets to the GoToUrl line, it opens the page in th browser but then it doesn't do further in the code.

It just times out after 60 seconds.

Here's the error:

fail: Microsoft.AspNetCore.Diagnostics.DeveloperExceptionPageMiddleware[1]
      An unhandled exception has occurred while executing the request.

Update: edited for privacy reasons.

Upvotes: 0

Views: 299

Answers (1)

ggeorge
ggeorge

Reputation: 1630

I made an example for your scenario.

Lets say, we want to scrape the posts in the home page so we need a model to store our data:

public class Post
{
    public string ImageSrc { get; set; }
    public string Category { get; set; }
    public string Title { get; set; }
    public string Description { get; set; }
    public string Date { get; set; }

    public override string ToString()
    {
        return JsonSerializer.Serialize(this, 
              new JsonSerializerOptions { WriteIndented = true });
    }
}

Next we need to initialize selenium webdriver

var options = new ChromeOptions();
options.AddArgument("--no-sandbox");
using var driver = new ChromeDriver(options);

// Here we setup a fluent wait
var wait = new WebDriverWait(driver, TimeSpan.FromSeconds(20))
{
    PollingInterval = TimeSpan.FromMilliseconds(250)
};
wait.IgnoreExceptionTypes(typeof(NoSuchElementException), typeof(StaleElementReferenceException));

// Navigate to the target url
driver.Navigate().GoToUrl("https://www.rtlnieuws.nl/zoeken?q=Philips+fraude");

// Accept cookies
var cookieBtn = wait.Until(driver => driver.FindElement(By.Id("onetrust-accept-btn-handler")));
cookieBtn.Click();

// Scroll to end
int count = 0; 
await driver.ScrollToEndAsync(d =>
{
    // Determine when we are at the end of the page
    var tempCount = d.FindElements(By.XPath("//a[@class = 'search-item search-item--artikel']")).Count;
    if (tempCount != count)
    {
        count = tempCount;
        return false;
    }       
    
    return true;
});

// List of post elements
var elements = wait.Until(driver =>
{
    return driver.FindElements(By.XPath("//div[@class = 'search-items']//a[contains(@class, 'search-item')]"));
});

// Print Posts in json format 
foreach (var e in elements)
{
    var post = new Post
    {
        ImageSrc = e.FindElement(By.XPath(".//img")).GetAttribute("src"),
        Category = e.FindElement(By.XPath(".//div/span")).Text,
        Title = e.FindElement(By.XPath(".//div/h2")).Text,
        Description = e.FindElement(By.XPath(".//div[@class = 'search-item__content']/p[@class = 'search-item__description']")).Text,
        Date = e.FindElement(By.XPath(".//div[@class = 'search-item__content']//span[@class = 'search-item__date']")).Text,
    };
    Console.WriteLine(post);
}

// Just for this sample in order to wait to see our results 
Console.ReadLine();

In order to use ScrollToEndAsync like above, you must create an extension method:

public static class WebDriverExtensions
{
    public static async Task ScrollToEndAsync(this IWebDriver driver, Func<IWebDriver, bool> pageEnd)
    {
        while (!pageEnd.Invoke(driver))
        {
            var js = (IJavaScriptExecutor)driver;
            js.ExecuteScript("window.scrollTo(0, document.body.scrollHeight);");
            
            // Arbitrary delay between scrolling
            await Task.Delay(200);
        }
    }
}

Upvotes: 1

Related Questions