DonkeyKong
DonkeyKong

Reputation: 1091

How to interact with internet explorer C++

I have a school project that I am working on and the outcome is pointless it seems, but it's got more to do with the experience gained through this I believe. What I am trying to do is submit an initial URL, then pull all the URLs on that page and visit them in order and do this until I tell it to stop. All of the URLs will be recorded in a text file. So far, I am able to open a window in IE and launch a webpage of my choosing. So now I need to know how to send IE to a new webpage using the same session and also how I can scan and pull data from the websites I visit. Thanks for any help!

Here is my code so far:

#include <string>
#include <iostream>
#include <windows.h>
#include <stdio.h>
#include <tchar.h>

using namespace std;

int main( int argc, TCHAR *argv[] )
{
    std::string uRL, prog;
    int length, count;

    STARTUPINFO si;
    PROCESS_INFORMATION pi;

    ZeroMemory( &si, sizeof(si) );
    si.cb = sizeof(si);
    ZeroMemory( &pi, sizeof(pi) );

    //if( argc != 2 )
    //{
    //    printf("Usage: %s [cmdline]\n", argv[0]);
    //    system("PAUSE");
    //    return 0;
    //}

    std::cout << "Enter URL: ";
    std::cin >> uRL;

    prog = ("C:\\Program Files\\Internet Explorer\\iexplore.exe ") + uRL;

    char *cstr = new char[prog.length() + 1];
    strcpy(cstr, prog.c_str());

    // Start the child process. 
    if( !CreateProcess(NULL,   // No module name (use command line)
        _T(cstr),        // Command line
        NULL,           // Process handle not inheritable
        NULL,           // Thread handle not inheritable
        FALSE,          // Set handle inheritance to FALSE
        0,              // No creation flags
        NULL,           // Use parent's environment block
        NULL,           // Use parent's starting directory 
        &si,            // Pointer to STARTUPINFO structure
        &pi )           // Pointer to PROCESS_INFORMATION structure
    ) 
    {
        printf( "CreateProcess failed (%d).\n", GetLastError() );
        system("PAUSE");
        return 0;
    }

    cout << HRESULT get_Count(long *Count) << endl;

    //cout << count << endl;

    system("PAUSE");

    // Wait until child process exits.
    WaitForSingleObject( pi.hProcess, INFINITE );

    // Close process and thread handles. 
    CloseHandle( pi.hProcess );
    CloseHandle( pi.hThread );

    delete [] cstr;

    return 0;
}

Upvotes: 0

Views: 5720

Answers (2)

Captain Obvlious
Captain Obvlious

Reputation: 20103

If you want to crawl a webpage launching Internet Explorer is not going to work very well. I also don't recommend attempting to parse the HTML page yourself unless you are prepared for a lot of heartache and hassle. Instead I recommend that you create an instance of an IWebBrowser2 object and use it to navigate to the webpage, grab the appropriate IHTMLDocument2 object and iterate through the elements picking out the URL's. It's far easier and is a common approach using components that are already installed on Windows. The example below should get your started and on your way to crawling the web like proper spider should.

#include <comutil.h>    // _variant_t
#include <mshtml.h>     // IHTMLDocument and IHTMLElement
#include <exdisp.h>     // IWebBrowser2
#include <atlbase.h>    // CComPtr
#include <string>
#include <iostream>
#include <vector>

// Make sure we link in the support library!
#pragma comment(lib, "comsuppw.lib")


// Load a webpage
HRESULT LoadWebpage(
    const CComBSTR& webpageURL,
    CComPtr<IWebBrowser2>& browser,
    CComPtr<IHTMLDocument2>& document)
{
    HRESULT hr;
    VARIANT empty;

    VariantInit(&empty);

    // Navigate to the specifed webpage
    hr = browser->Navigate(webpageURL, &empty, &empty, &empty, &empty);

    //  Wait for the load.
    if(SUCCEEDED(hr))
    {
        READYSTATE state;

        while(SUCCEEDED(hr = browser->get_ReadyState(&state)))
        {
            if(state == READYSTATE_COMPLETE) break;
        }
    }

    // The browser now has a document object. Grab it.
    if(SUCCEEDED(hr))
    {
        CComPtr<IDispatch> dispatch;

        hr = browser->get_Document(&dispatch);
        if(SUCCEEDED(hr) && dispatch != NULL)
        {
            hr = dispatch.QueryInterface<IHTMLDocument2>(&document);
        }
        else
        {
            hr = E_FAIL;
        }
    }

    return hr;
}


void CrawlWebsite(const CComBSTR& webpage, std::vector<std::wstring>& urlList)
{
    HRESULT hr;

    // Create a browser object
    CComPtr<IWebBrowser2> browser;
    hr = CoCreateInstance(
        CLSID_InternetExplorer,
        NULL,
        CLSCTX_SERVER,
        IID_IWebBrowser2,
        reinterpret_cast<void**>(&browser));

    // Grab a web page
    CComPtr<IHTMLDocument2> document;
    if(SUCCEEDED(hr))
    {
        // Make sure these two items are scoped so CoUninitialize doesn't gump
        // us up.
        hr = LoadWebpage(webpage, browser, document);
    }

    // Grab all the anchors!
    if(SUCCEEDED(hr))
    {
        CComPtr<IHTMLElementCollection> urls;
        long count = 0;

        hr = document->get_all(&urls);

        if(SUCCEEDED(hr))
        {
            hr = urls->get_length(&count);
        }

        if(SUCCEEDED(hr))
        {
            for(long i = 0; i < count; i++)
            {
                CComPtr<IDispatch>  element;
                CComPtr<IHTMLAnchorElement> anchor;

                // Get an IDispatch interface for the next option.
                _variant_t index = i;
                hr = urls->item( index, index, &element);
                if(SUCCEEDED(hr))
                {
                    hr = element->QueryInterface(
                        IID_IHTMLAnchorElement, 
                        reinterpret_cast<void **>(&anchor));
                }

                if(SUCCEEDED(hr) && anchor != NULL)
                {
                    CComBSTR    url;
                    hr = anchor->get_href(&url);
                    if(SUCCEEDED(hr) && url != NULL)
                    {
                        urlList.push_back(std::wstring(url));
                    }
                }
            }
        }
    }
}

int main()
{
    HRESULT hr;

    hr = CoInitialize(NULL);
    std::vector<std::wstring>   urls;

    CComBSTR webpage(L"http://cppreference.com");


    CrawlWebsite(webpage, urls);
    for(std::vector<std::wstring>::iterator it = urls.begin();
        it != urls.end();
        ++it)
    {
        std::wcout << "URL: " << *it << std::endl;

    }

    CoUninitialize();

    return 0;
}

Upvotes: 1

Nathan
Nathan

Reputation: 78549

To scan and pull data from the websites, you'll want to capture the HTML and iterate through it looking for all character sequences matching a certain pattern. Have you ever used regular expressions? Regular expressions would by far be the best here, but if you understand them (just look up a tutorial on the basics) then you can manually apply the pattern-recognition concepts to this project.

So what you're looking for is something like http(s)://.. It's more complex though, because domain names are a rather intricate pattern. You'll probably want to use a third-party HTML parser or regular expression library, but it's doable without it, although pretty tedious to program.

Here's a link about regular expressions in c++: http://www.johndcook.com/cpp_regex.html

Upvotes: 0

Related Questions