netsky
netsky

Reputation: 127

searching for hundreds of patterns in huge Logfiles

I have to get lots of filenames from inside a webserver's htdocs directory and then take this list of filenames to search a huge amount of archived logfiles for last access on these files.

I plan to do this in C++ with Boost. I would take newest log first and read it backwards checking every single line for all of the filenames I got.

If a filename matches, I read the Time from Logstring and save it's last access. Now I don't need to look for this file any more as I only want to know last access.

The vector of filenames to search for should rapidly decrease.

I wonder how I can handle this kind of problem with multiple threads most effective.

Do I partition the Logfiles and let every thread search a part of the logs from memory and if a thread has a match it removes this filename from the filenames vector or is there a more effective way to do this?

Upvotes: 1

Views: 198

Answers (3)

netsky
netsky

Reputation: 127

Ok this is some days ago already but I spent some time writing code and working with SQLite in other projects.

I still wanted to compare the DB-Approach with the MMAP Solution just for the performance aspect.

Of course it saves you a lot of work if you can use SQL-Queries to handle all the data you parsed. But I really didn't care about the work amount because I'm still learning a lot and what I learned from this is:

This MMAP-Approach - if you implement it correctly - is absolutely superior in performance. It's unbelievable fast which you will notice if you implement the "word-count" example which can be seen as the "hello world" for MapReduce Algo.

Now if you further want to benefit from SQL-Language the correct approach would be implementing your own SQL-Wrapper that uses kind of Map-Reduce too by the means of sharing queries amongst threads.

You could perhaps share Objects by ID amongst threads, where every thread handles it's own DB-Connection. It then queries Objects in it's own part of the dataset.

This would be much faster than just writing things to SQLite DB the usual Way.

After all you can say:

MMAP is the fastest way to handle string processing SQL provides great functionality for parser-applications but it slows down things if you don't implement a wrapper for processing SQL-Queries

Upvotes: 0

Keshav Saharia
Keshav Saharia

Reputation: 983

Try using mmap, it will save you considerable hair loss. I was feeling expeditious and in some odd mood to recall my mmap knowledge, so I wrote a simple thing to get you started. Hope this helps!

The beauty of mmap is that it can be easily parallelized with OpenMP. It's also a really good way to prevent an I/O bottleneck. Let me first define the Logfile class and then I'll go over implementation.

Here's the header file (logfile.h)

#ifndef _LOGFILE_H_
#define _LOGFILE_H_

#include <iostream>
#include <fcntl.h>
#include <stdio.h>
#include <string>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>

using std::string;

class Logfile {

public:

    Logfile(string title);

    char* open();
    unsigned int get_size() const;
    string get_name() const;
    bool close();

private:

    string name;
    char* start;
    unsigned int size;
    int file_descriptor;

};

#endif

And here's the .cpp file.

#include <iostream>
#include "logfile.h"

using namespace std;

Logfile::Logfile(string name){
    this->name = name;
    start = NULL;
    size = 0;
    file_descriptor = -1;

}

char* Logfile::open(){

    // get file size
    struct stat st;
    stat(title.c_str(), &st);

    size = st.st_size;

    // get file descriptor
    file_descriptor = open(title.c_str(), O_RDONLY);
    if(file_descriptor < 0){
        cerr << "Error obtaining file descriptor for: " << title.c_str() << endl;
        return NULL;
    }

    // memory map part
    start = (char*) mmap(NULL, size, PROT_READ, MAP_SHARED, file_descriptor, 0);
    if(start == NULL){
        cerr << "Error memory-mapping the file\n";
        close(file_descriptor);
        return NULL;
    }

    return start;
}

unsigned int Logfile::get_size() const {
    return size;
}

string Logfile::get_title() const {
    return title;
}

bool Logfile::close(){

    if( start == NULL){
        cerr << "Error closing file. Was closetext() called without a matching opentext() ?\n";
        return false;
    }

    // unmap memory and close file
    bool ret = munmap(start, size) != -1 && close(file_descriptor) != -1;
    start = NULL;
    return ret;

}

Now, using this code, you can use OpenMP to work-share the parsing of these logfiles, i.e.

Logfile lf ("yourfile");
char * log = lf.open();
int size = (int) lf.get_size();

#pragma omp parallel shared(log, size) private(i)
{
  #pragma omp for
  for (i = 0 ; i < size ; i++) {
     // do your routine
  }
  #pragma omp critical
     // some methods that combine the thread results
}

Upvotes: 1

Byron Whitlock
Byron Whitlock

Reputation: 53861

Parsing the logfile into a database table (SQLite ftw). One of the fields will be the path.

In another table, add the files you are looking for.

Now it is a simple join on a derived table. Something like this.

SELECT l.file, l.last_access FROM toFind f
LEFT JOIN ( 
    SELECT file, max(last_access) as last_access from logs group by file
) as l ON f.file = l.file

All the files in toFind will be there, and will have last_access NULL for those not found in the logs.

Upvotes: 1

Related Questions