How can I use C# to save a webpage as a text file for later parsing

Question

I would like to load for example this page (url) "http://finance.yahoo.com/q/ks?s=FORK+Key+Statistic" from C# and then save the page as a text file for later parsing or scraping. I know that I can do this from the browser (Firefox in my case) by right clicking on the page and then "Save Page As..." and then save it as a text file. Then all the text with the data I need will be in a text file for later parsing. I would like to know how to automate this process from C#. I found this code from MSDN that automates printing a web-page:

private void PrintHelpPage()
{
    // Create a WebBrowser instance. 
    WebBrowser webBrowserForPrinting = new WebBrowser();

    // Add an event handler that prints the document after it loads.
    webBrowserForPrinting.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(PrintDocument);

    // Set the Url property to load the document.
    webBrowserForPrinting.Url = new Uri(@"\myshare\help.html");
}

private void PrintDocument(object sender, WebBrowserDocumentCompletedEventArgs e)
{
    // Print the document now that it is fully loaded.
    ((WebBrowser)sender).Print();

    // Dispose the WebBrowser now that the task is complete. 
    ((WebBrowser)sender).Dispose();
}

This works except that only the page header is printed. Does anyone know of a way to do approximately the same thing with something like the Save or 'Save Page As' command from the browser? I have also tried other options such as htmlAgilityPack, WebClient, and htrpClient. These methods all return the html source code, which does not contain any of the data shown on the web-page. If I could find out how to find the location id for the data elements on the web-page, that also might be useful.

I finally got it to work (see code below):

        WebBrowser browser = new WebBrowser();
        browser.ScriptErrorsSuppressed = true;
        int j = 0;
        label1.Text = j.ToString();
        label1.Refresh();
        int SleepTime = 3000;
        loadPage: browser.Navigate("http://finance.yahoo.com/q/ks?s=GBX+Key+Statistic");
        System.Threading.Thread.Sleep(SleepTime);
        MessageBox.Show("browser.Navigae OK"); //Why is MessageBox needed here???
        label1.Refresh();
        if (browser.ReadyState == WebBrowserReadyState.Complete)
        {
             // It's done!
            string path = @"C:\VS2015Projects\C#\caoStocksCS	extFiles\somefile13.txt";
            //MessageBox.Show("path OK");
            if (browser.Document.Body.Parent.InnerText != null)
            {
                File.WriteAllText(path, browser.Document.Body.Parent.InnerText, Encoding.GetEncoding(browser.Document.Encoding));
                MessageBox.Show("Success! somefile13.txt created");
            }
            else
            {
                MessageBox.Show("browser.Document.Body.Parent.InnerText=" + browser.Document.Body.Parent.InnerText);
                MessageBox.Show("Failure somefile13.txt not created");
            }
        }
        else
        {
            SleepTime += SleepTime;
            ++j;
            label1.Text = j.ToString();
            goto loadPage;
        }

But, it is not fully automated because the MessageBox.Show("browser.Navigae OK"); //Why is MessageBox needed here??? or some other message box is needed here, or else it just keeps going.
Does anyone know why the MessageBox is needed? Is there anyway I can do the same thing the MessageBox does without having to call a message box here? Doesn't the MessageBox pause the system until it is clicked or dismissed? Is there any way I can do this without the message box?

How can I use C# to save a webpage as a text file for later parsing

Answers (1)

Related Questions