user472507
user472507

Reputation: 75

Bulk download of web pages using Qt

I want to write a program using Qt that downloads a lot of HTML web pages, about 5000, from one site every day. After downloading that pages I need to extract some data using DOM Query, using the WebKit module, and then store that data in a database.

Which is the best/correct/efficient way to do that, in particular the download and analysis phase? How do I handle that amount of requests and how to create the "download manager"?

Upvotes: 3

Views: 1183

Answers (2)

Jeffrey Holmes
Jeffrey Holmes

Reputation: 367

this has already been answered but here is a solution using what you asked for, and that was doing this with QT.

You can make a (Website Crawler) using QT (Specifically QNetworkManager, QNetworkRequests, QNetworkReply). I'm not sure exactly if this is the proper way to handle such a task, but i found utilizing multiple threads you can maximize efficiency and save time. (Please someone tell me if there is another way / or confirm if this is good practice)

Concept is that a list of work is queued, and a worker will perform the work, and after receiving the information/html, process it and then continue on to the next item.

Class Worker Object Class should accepts a Url, processes and downloads a url's html data and then processes the information when received.

Create a Queue and Manager for the Queue I created a QQueue< QString> urlList to control the amount of concurrent items being processed and the list of tasks to be completed.

    QQueue <String> workQueue; //First create somewhere a 
    int maxWorkers = 10;


    //Then create the workers
    void downloadNewArrivals::createWorkers(QString url){
checkNewArrivalWorker* worker = new checkNewArrivalWorker(url);
workQueue.enqueue(worker);
}

    //Make a function to control the amount of workers, 
    //and process the workers after they are finished

    void downloadNewArrivals::processWorkQueue(){
if (workQueue.isEmpty() && currentWorkers== 0){
    qDebug() << "Work Queue Empty" << endl;
} else if (!workQueue.isEmpty()){
    //Create the maxWorkers and start them in seperate threads
    for (int i = 0; i < currentWorkers && !workQueue.isEmpty(); i++){
        QThread* thread = new QThread;
        checkNewArrivalWorker* worker = workQueue.dequeue();
        worker->moveToThread(thread);
        connect(worker, SIGNAL(error(QString)), this, SLOT(errorString(QString)));
        connect(thread, SIGNAL(started()), worker, SLOT(process()));
        connect(worker, SIGNAL(finished()), thread, SLOT(quit()));
        connect(worker, SIGNAL(finished()), worker, SLOT(deleteLater()));
        connect(thread, SIGNAL(finished()), this, SLOT(reduceThreadCounterAndProcessNext()));
        connect(thread, SIGNAL(finished()), thread, SLOT(deleteLater()));
        thread->start();
        currentWorkers++;
    }
}
}

    //When finished, process the next worker
    void downloadNewArrivals::reduceThreadCounterAndProcessNext(){
currentWorkers--;  //This variable is to control amount of max workers

processWorkQueue();
    }


    //Now the worker
    //The worker class important parts..
    void checkNewArrivalWorker::getPages(QString url){
QNetworkAccessManager *manager = new QNetworkAccessManager(this);
QNetworkRequest getPageRequest = QNetworkRequest(url); //created on heap 
getPageRequest.setRawHeader( "User-Agent", "Mozilla/5.0 (X11; U; Linux i686 (x86_64); "
                           "en-US; rv:1.9.0.1) Gecko/2008070206 Firefox/3.0.1" );
getPageRequest.setRawHeader( "charset", "utf-8" );
getPageRequest.setRawHeader( "Connection", "keep-alive" );
connect(manager, SIGNAL(finished(QNetworkReply*)), this, SLOT(replyGetPagesFinished(QNetworkReply*)));
connect(manager, SIGNAL(finished(QNetworkReply*)), manager, SLOT(deleteLater()));
manager->get(getPageRequest);
}

    void checkNewArrivalWorker::replyGetPagesFinished(QNetworkReply *reply){
QString data = reply->readAll(); //Here data will hold your html to process as needed...
reply->deleteLater();
emit finished();


}

After you get your information, I just processed the information from a QString, but im sure you can work out how to use a DOM parser once you get to this stage.

I hope this is a sufficient example enough to help you.

Upvotes: 1

PiedPiper
PiedPiper

Reputation: 5785

To download the pages it makes sense to use a dedicated library like libcurl

Upvotes: 2

Related Questions