Reputation: 127
I have to get lots of filenames from inside a webserver's htdocs directory and then take this list of filenames to search a huge amount of archived logfiles for last access on these files.
I plan to do this in C++ with Boost. I would take newest log first and read it backwards checking every single line for all of the filenames I got.
If a filename matches, I read the Time from Logstring and save it's last access. Now I don't need to look for this file any more as I only want to know last access.
The vector of filenames to search for should rapidly decrease.
I wonder how I can handle this kind of problem with multiple threads most effective.
Do I partition the Logfiles and let every thread search a part of the logs from memory and if a thread has a match it removes this filename from the filenames vector or is there a more effective way to do this?
Upvotes: 1
Views: 198
Reputation: 127
Ok this is some days ago already but I spent some time writing code and working with SQLite in other projects.
I still wanted to compare the DB-Approach with the MMAP Solution just for the performance aspect.
Of course it saves you a lot of work if you can use SQL-Queries to handle all the data you parsed. But I really didn't care about the work amount because I'm still learning a lot and what I learned from this is:
This MMAP-Approach - if you implement it correctly - is absolutely superior in performance. It's unbelievable fast which you will notice if you implement the "word-count" example which can be seen as the "hello world" for MapReduce Algo.
Now if you further want to benefit from SQL-Language the correct approach would be implementing your own SQL-Wrapper that uses kind of Map-Reduce too by the means of sharing queries amongst threads.
You could perhaps share Objects by ID amongst threads, where every thread handles it's own DB-Connection. It then queries Objects in it's own part of the dataset.
This would be much faster than just writing things to SQLite DB the usual Way.
After all you can say:
MMAP is the fastest way to handle string processing SQL provides great functionality for parser-applications but it slows down things if you don't implement a wrapper for processing SQL-Queries
Upvotes: 0
Reputation: 983
Try using mmap, it will save you considerable hair loss. I was feeling expeditious and in some odd mood to recall my mmap knowledge, so I wrote a simple thing to get you started. Hope this helps!
The beauty of mmap is that it can be easily parallelized with OpenMP. It's also a really good way to prevent an I/O bottleneck. Let me first define the Logfile class and then I'll go over implementation.
Here's the header file (logfile.h)
#ifndef _LOGFILE_H_
#define _LOGFILE_H_
#include <iostream>
#include <fcntl.h>
#include <stdio.h>
#include <string>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
using std::string;
class Logfile {
public:
Logfile(string title);
char* open();
unsigned int get_size() const;
string get_name() const;
bool close();
private:
string name;
char* start;
unsigned int size;
int file_descriptor;
};
#endif
And here's the .cpp file.
#include <iostream>
#include "logfile.h"
using namespace std;
Logfile::Logfile(string name){
this->name = name;
start = NULL;
size = 0;
file_descriptor = -1;
}
char* Logfile::open(){
// get file size
struct stat st;
stat(title.c_str(), &st);
size = st.st_size;
// get file descriptor
file_descriptor = open(title.c_str(), O_RDONLY);
if(file_descriptor < 0){
cerr << "Error obtaining file descriptor for: " << title.c_str() << endl;
return NULL;
}
// memory map part
start = (char*) mmap(NULL, size, PROT_READ, MAP_SHARED, file_descriptor, 0);
if(start == NULL){
cerr << "Error memory-mapping the file\n";
close(file_descriptor);
return NULL;
}
return start;
}
unsigned int Logfile::get_size() const {
return size;
}
string Logfile::get_title() const {
return title;
}
bool Logfile::close(){
if( start == NULL){
cerr << "Error closing file. Was closetext() called without a matching opentext() ?\n";
return false;
}
// unmap memory and close file
bool ret = munmap(start, size) != -1 && close(file_descriptor) != -1;
start = NULL;
return ret;
}
Now, using this code, you can use OpenMP to work-share the parsing of these logfiles, i.e.
Logfile lf ("yourfile");
char * log = lf.open();
int size = (int) lf.get_size();
#pragma omp parallel shared(log, size) private(i)
{
#pragma omp for
for (i = 0 ; i < size ; i++) {
// do your routine
}
#pragma omp critical
// some methods that combine the thread results
}
Upvotes: 1
Reputation: 53861
Parsing the logfile into a database table (SQLite ftw). One of the fields will be the path.
In another table, add the files you are looking for.
Now it is a simple join on a derived table. Something like this.
SELECT l.file, l.last_access FROM toFind f
LEFT JOIN (
SELECT file, max(last_access) as last_access from logs group by file
) as l ON f.file = l.file
All the files in toFind will be there, and will have last_access NULL for those not found in the logs.
Upvotes: 1