AlphonseSun
AlphonseSun

Reputation: 1

My Azure web app can not access some website pages

I have a web app hosted in Azure (Location: West US2). It is a backend API app and provide APIs for my front-side web app. In one of the APIs, I need to get the page content from a given URL. The codes like below:

using HttpClient httpClient = new HttpClient();
httpClient.Timeout = TimeSpan.FromMinutes(5);
var htmlContent = await httpClient.GetStringAsync(data.Url);
var browsingContext = BrowsingContext.New(Configuration.Default);
var browsingDocument = await browsingContext.OpenAsync(req => req.Content(htmlContent));
var textContent = browsingDocument.Body.TextContent;

But when I tried to access a URL like "https://www.allstate.com/auto-insurance/car-coverage-policies", it threw Timeout error.

I tried some other URLs, like https://www.youtube.com, https://www.google.com, and they all worked. It looks my app can't access some URLs.

So, does anyone know how to resolve this issue to make my app could access that " "https://www.allstate.com/..." page? Do I need to add some web app configurations in Azure or add Azure virtual network? I don't know much about Azure network. it would be appreciated if anyone could provide detailed answers. Thanks a lot!

Upvotes: 0

Views: 262

Answers (1)

SiddheshDesai
SiddheshDesai

Reputation: 8187

Even I faced the same error and the website kept on loading while fetching content from :- https://www.allstate.com/auto-insurance/car-coverage-policies this might happen due to:-

  • Firewall or Network Restrictions: Check for any network restrictions or firewall rules blocking access to the domain.
  • Rate Limiting or Bot Detection: Verify if the website employs rate limiting or bot detection mechanisms that might be blocking automated access. Try accessing the site from a different IP address or user agent.
  • SSL Certificate: Ensure the SSL certificate for the domain is valid and trusted by your environment to prevent HTTPS request failures.
  • Server-Side Rendering: Websites relying heavily on client-side rendering might not be fully captured by HttpClient. Consider using headless browsers like Puppeteer for accurate rendering.

But it worked and content was fetched from:- https://en.wikipedia.org/wiki/Food

My Code with proxy and cookies:-

using HtmlAgilityPack;
using Microsoft.AspNetCore.Mvc;
using System;
using System.Diagnostics;
using System.Net.Http;
using System.Threading.Tasks;
using WebApplication3.Models;

namespace WebApplication3.Controllers
{
    public class HomeController : Controller
    {
        private readonly ILogger<HomeController> _logger;

        public HomeController(ILogger<HomeController> logger)
        {
            _logger = logger;
        }

        public IActionResult Index()
        {
            return View();
        }

        public IActionResult Privacy()
        {
            return View();
        }

        [HttpGet]
        public async Task<IActionResult> GetExternalContent()
        {
            // URL to access
            string url = "https://en.wikipedia.org/wiki/Food";

            try
            {
                // Create HttpClientHandler to bypass cookies and SSL
                var httpClientHandler = new HttpClientHandler
                {
                    UseCookies = true,
                    ServerCertificateCustomValidationCallback = (message, cert, chain, errors) => true,
                    Proxy = null // Bypass proxy settings
                };

                // Set up HttpClient with HttpClientHandler
                using (var httpClient = new HttpClient(httpClientHandler))
                {
                    // Set timeout duration
                    httpClient.Timeout = TimeSpan.FromMinutes(5);

                    // Send GET request to the URL
                    HttpResponseMessage response = await httpClient.GetAsync(url);

                    // Check if the response is successful
                    if (response.IsSuccessStatusCode)
                    {
                        // Read the content as string
                        string htmlContent = await response.Content.ReadAsStringAsync();

                        // Parse the HTML content using HtmlAgilityPack
                        HtmlDocument htmlDocument = new HtmlDocument();
                        htmlDocument.LoadHtml(htmlContent);

                        // Extract specific elements or attributes from the HTML document
                        HtmlNodeCollection nodes = htmlDocument.DocumentNode.SelectNodes("//p");

                        // Construct a new string with the extracted content
                        string extractedContent = "";
                        foreach (HtmlNode node in nodes)
                        {
                            extractedContent += node.InnerText + Environment.NewLine;
                        }

                        // Pass the extracted content to the view
                        ViewData["ExtractedContent"] = extractedContent;
                        return View("ExternalContent");
                    }
                    else
                    {
                        // Handle unsuccessful response (e.g., log error, return error view)
                        return View("Error", new ErrorViewModel { RequestId = Activity.Current?.Id ?? HttpContext.TraceIdentifier });
                    }
                }
            }
            catch (Exception ex)
            {
                // Handle any exceptions that occur during the request
                return View("Error", new ErrorViewModel { RequestId = Activity.Current?.Id ?? HttpContext.TraceIdentifier });
            }
        }

    }
}

Wikipedia content fetched:-

enter image description here

I'd recommend using Web scrapping tool like Power Automate Desktop refer my SO answer here or scrapFly tool.

Upvotes: 0

Related Questions