Reputation: 1
I have a web app hosted in Azure (Location: West US2). It is a backend API app and provide APIs for my front-side web app. In one of the APIs, I need to get the page content from a given URL. The codes like below:
using HttpClient httpClient = new HttpClient();
httpClient.Timeout = TimeSpan.FromMinutes(5);
var htmlContent = await httpClient.GetStringAsync(data.Url);
var browsingContext = BrowsingContext.New(Configuration.Default);
var browsingDocument = await browsingContext.OpenAsync(req => req.Content(htmlContent));
var textContent = browsingDocument.Body.TextContent;
But when I tried to access a URL like "https://www.allstate.com/auto-insurance/car-coverage-policies", it threw Timeout error.
I tried some other URLs, like https://www.youtube.com, https://www.google.com, and they all worked. It looks my app can't access some URLs.
So, does anyone know how to resolve this issue to make my app could access that " "https://www.allstate.com/..." page? Do I need to add some web app configurations in Azure or add Azure virtual network? I don't know much about Azure network. it would be appreciated if anyone could provide detailed answers. Thanks a lot!
Upvotes: 0
Views: 262
Reputation: 8187
Even I faced the same error and the website kept on loading while fetching content from :-
https://www.allstate.com/auto-insurance/car-coverage-policies
this might happen due to:-
But it worked and content was fetched from:- https://en.wikipedia.org/wiki/Food
My Code with proxy and cookies:-
using HtmlAgilityPack;
using Microsoft.AspNetCore.Mvc;
using System;
using System.Diagnostics;
using System.Net.Http;
using System.Threading.Tasks;
using WebApplication3.Models;
namespace WebApplication3.Controllers
{
public class HomeController : Controller
{
private readonly ILogger<HomeController> _logger;
public HomeController(ILogger<HomeController> logger)
{
_logger = logger;
}
public IActionResult Index()
{
return View();
}
public IActionResult Privacy()
{
return View();
}
[HttpGet]
public async Task<IActionResult> GetExternalContent()
{
// URL to access
string url = "https://en.wikipedia.org/wiki/Food";
try
{
// Create HttpClientHandler to bypass cookies and SSL
var httpClientHandler = new HttpClientHandler
{
UseCookies = true,
ServerCertificateCustomValidationCallback = (message, cert, chain, errors) => true,
Proxy = null // Bypass proxy settings
};
// Set up HttpClient with HttpClientHandler
using (var httpClient = new HttpClient(httpClientHandler))
{
// Set timeout duration
httpClient.Timeout = TimeSpan.FromMinutes(5);
// Send GET request to the URL
HttpResponseMessage response = await httpClient.GetAsync(url);
// Check if the response is successful
if (response.IsSuccessStatusCode)
{
// Read the content as string
string htmlContent = await response.Content.ReadAsStringAsync();
// Parse the HTML content using HtmlAgilityPack
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(htmlContent);
// Extract specific elements or attributes from the HTML document
HtmlNodeCollection nodes = htmlDocument.DocumentNode.SelectNodes("//p");
// Construct a new string with the extracted content
string extractedContent = "";
foreach (HtmlNode node in nodes)
{
extractedContent += node.InnerText + Environment.NewLine;
}
// Pass the extracted content to the view
ViewData["ExtractedContent"] = extractedContent;
return View("ExternalContent");
}
else
{
// Handle unsuccessful response (e.g., log error, return error view)
return View("Error", new ErrorViewModel { RequestId = Activity.Current?.Id ?? HttpContext.TraceIdentifier });
}
}
}
catch (Exception ex)
{
// Handle any exceptions that occur during the request
return View("Error", new ErrorViewModel { RequestId = Activity.Current?.Id ?? HttpContext.TraceIdentifier });
}
}
}
}
Wikipedia content fetched:-
I'd recommend using Web scrapping tool like Power Automate Desktop refer my SO answer here or scrapFly tool.
Upvotes: 0