How to interact with internet explorer C++

Question

I have a school project that I am working on and the outcome is pointless it seems, but it's got more to do with the experience gained through this I believe. What I am trying to do is submit an initial URL, then pull all the URLs on that page and visit them in order and do this until I tell it to stop. All of the URLs will be recorded in a text file. So far, I am able to open a window in IE and launch a webpage of my choosing. So now I need to know how to send IE to a new webpage using the same session and also how I can scan and pull data from the websites I visit. Thanks for any help!

Here is my code so far:

#include 
#include 
#include 
#include 
#include 

using namespace std;

int main( int argc, TCHAR *argv[] )
{
    std::string uRL, prog;
    int length, count;

    STARTUPINFO si;
    PROCESS_INFORMATION pi;

    ZeroMemory( &si, sizeof(si) );
    si.cb = sizeof(si);
    ZeroMemory( &pi, sizeof(pi) );

    //if( argc != 2 )
    //{
    //    printf("Usage: %s [cmdline]
", argv[0]);
    //    system("PAUSE");
    //    return 0;
    //}

    std::cout << "Enter URL: ";
    std::cin >> uRL;

    prog = ("C:\Program Files\Internet Explorer\iexplore.exe ") + uRL;

    char *cstr = new char[prog.length() + 1];
    strcpy(cstr, prog.c_str());

    // Start the child process. 
    if( !CreateProcess(NULL,   // No module name (use command line)
        _T(cstr),        // Command line
        NULL,           // Process handle not inheritable
        NULL,           // Thread handle not inheritable
        FALSE,          // Set handle inheritance to FALSE
        0,              // No creation flags
        NULL,           // Use parent's environment block
        NULL,           // Use parent's starting directory 
        &si,            // Pointer to STARTUPINFO structure
        &pi )           // Pointer to PROCESS_INFORMATION structure
    ) 
    {
        printf( "CreateProcess failed (%d).
", GetLastError() );
        system("PAUSE");
        return 0;
    }

    cout << HRESULT get_Count(long *Count) << endl;

    //cout << count << endl;

    system("PAUSE");

    // Wait until child process exits.
    WaitForSingleObject( pi.hProcess, INFINITE );

    // Close process and thread handles. 
    CloseHandle( pi.hProcess );
    CloseHandle( pi.hThread );

    delete [] cstr;

    return 0;
}

Captain Obvlious · Accepted Answer

If you want to crawl a webpage launching Internet Explorer is not going to work very well. I also don't recommend attempting to parse the HTML page yourself unless you are prepared for a lot of heartache and hassle. Instead I recommend that you create an instance of an IWebBrowser2 object and use it to navigate to the webpage, grab the appropriate IHTMLDocument2 object and iterate through the elements picking out the URL's. It's far easier and is a common approach using components that are already installed on Windows. The example below should get your started and on your way to crawling the web like proper spider should.

#include     // _variant_t
#include      // IHTMLDocument and IHTMLElement
#include      // IWebBrowser2
#include     // CComPtr
#include 
#include 
#include 

// Make sure we link in the support library!
#pragma comment(lib, "comsuppw.lib")


// Load a webpage
HRESULT LoadWebpage(
    const CComBSTR& webpageURL,
    CComPtr& browser,
    CComPtr& document)
{
    HRESULT hr;
    VARIANT empty;

    VariantInit(&empty);

    // Navigate to the specifed webpage
    hr = browser->Navigate(webpageURL, &empty, &empty, &empty, &empty);

    //  Wait for the load.
    if(SUCCEEDED(hr))
    {
        READYSTATE state;

        while(SUCCEEDED(hr = browser->get_ReadyState(&state)))
        {
            if(state == READYSTATE_COMPLETE) break;
        }
    }

    // The browser now has a document object. Grab it.
    if(SUCCEEDED(hr))
    {
        CComPtr dispatch;

        hr = browser->get_Document(&dispatch);
        if(SUCCEEDED(hr) && dispatch != NULL)
        {
            hr = dispatch.QueryInterface(&document);
        }
        else
        {
            hr = E_FAIL;
        }
    }

    return hr;
}


void CrawlWebsite(const CComBSTR& webpage, std::vector& urlList)
{
    HRESULT hr;

    // Create a browser object
    CComPtr browser;
    hr = CoCreateInstance(
        CLSID_InternetExplorer,
        NULL,
        CLSCTX_SERVER,
        IID_IWebBrowser2,
        reinterpret_cast(&browser));

    // Grab a web page
    CComPtr document;
    if(SUCCEEDED(hr))
    {
        // Make sure these two items are scoped so CoUninitialize doesn't gump
        // us up.
        hr = LoadWebpage(webpage, browser, document);
    }

    // Grab all the anchors!
    if(SUCCEEDED(hr))
    {
        CComPtr urls;
        long count = 0;

        hr = document->get_all(&urls);

        if(SUCCEEDED(hr))
        {
            hr = urls->get_length(&count);
        }

        if(SUCCEEDED(hr))
        {
            for(long i = 0; i < count; i++)
            {
                CComPtr  element;
                CComPtr anchor;

                // Get an IDispatch interface for the next option.
                _variant_t index = i;
                hr = urls->item( index, index, &element);
                if(SUCCEEDED(hr))
                {
                    hr = element->QueryInterface(
                        IID_IHTMLAnchorElement, 
                        reinterpret_cast(&anchor));
                }

                if(SUCCEEDED(hr) && anchor != NULL)
                {
                    CComBSTR    url;
                    hr = anchor->get_href(&url);
                    if(SUCCEEDED(hr) && url != NULL)
                    {
                        urlList.push_back(std::wstring(url));
                    }
                }
            }
        }
    }
}

int main()
{
    HRESULT hr;

    hr = CoInitialize(NULL);
    std::vector   urls;

    CComBSTR webpage(L"http://cppreference.com");


    CrawlWebsite(webpage, urls);
    for(std::vector::iterator it = urls.begin();
        it != urls.end();
        ++it)
    {
        std::wcout << "URL: " << *it << std::endl;

    }

    CoUninitialize();

    return 0;
}

How to interact with internet explorer C++

Answers (2)

Related Questions