WtFudgE
WtFudgE

Reputation: 5228

Get HTML code from a website that has a loading page in C#

I am using the code from this post: Get HTML code from website in C#

to save the html in a string

HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
if (response.StatusCode == HttpStatusCode.OK)
{
    Stream receiveStream = response.GetResponseStream();
    StreamReader readStream;
    if (response.CharacterSet == null)
        readStream = new StreamReader(receiveStream);
    else
        readStream = new StreamReader(receiveStream, Encoding.GetEncoding(response.CharacterSet));
    string data = readStream.ReadToEnd();
    response.Close();
    readStream.Close();

    msgBox.Text = data;
}

However the page I am trying to read has a temporary loader page, how can I get around this that it tries to save the html again after this page is actually loaded?

Best regards

Upvotes: 0

Views: 1870

Answers (2)

user7226134
user7226134

Reputation: 1

why don't you use webbrowser and make delay with

await Task.Delay(n)

Upvotes: 0

David
David

Reputation: 218798

the page I am trying to read has a temporary loader page

It all depends on what that means and how that "temporary loader page" works. For example, if that page is (whether from JavaScript code or some HTML META redirect) making a request to the destination page, than that request is what you need to capture. Currently you're reading from a given URL:

(HttpWebRequest)WebRequest.Create(url)

This is essentially making a GET request to that URL and reading the response. But based on your description it sounds like that's the wrong URL. It sounds like there's a second URL which contains the actual information you're looking for.

Given that, you essentially have two options:

  1. Determine what that other URL is manually from visiting the page and inspecting the requests in your browser and use that as the value of url in your code.
  2. Determine how that other URL is itself determined by the page code of the first URL (is it something embedded in the page source somewhere?), parse it out of the response you get from the first url value, and make a second request to the new URL.

Clearly the first option is a lot easier. The second is only necessary if that second URL changes with each visit or is expected to change frequently over time. If that's the case then you'd have to basically reverse-engineer how the website is performing the second request so you can perform it as well.

Web scraping can get complicated pretty quickly, and often turns into a game of cat and mouse (even unintentionally and mutually unaware) between the person scraping the content and the person hosting the content (who might not want it to be scraped).

Upvotes: 2

Related Questions