Reputation: 613
I've to solve the following question for my OS homework. I've done some work but I am not quite there yet. Help will be greatly appreciated.
Question
Your task is to create a multi-threaded document analyzer. Your program should be able to use a variable number of threads to process a supplied file, and produce some statistics about it. The required numbers are: • The number of words (found by counting the number of spaces) • The number of letters (found by using the isalpha () function) • The number of punctuation characters (found by using the ispunct () function). An example run that would involve 4 threads is shown below: $ ./docAnal 4 test.txt Words : 1245 Letters : 24313 Punctuation : 87 The document should be evenly split between the required threads.You should not hard-code your program parameters. They should be read from the command-line as in the example shown above
This is my code so far
#include <QThread>
#include <iostream>
#include <fstream>
#include <string>
#include <locale>
using namespace std;
//int count=0;
char buff[200];
class MyThread: public QThread
{
private : int space, word, punc = 0,countl=0;
int ID;
public:
MyThread(int i) : ID(i) {}
void run (){ ifstream myfile;
ifstream fin;
fin.open("example.txt");
myfile.open("example.txt");
cout<<"Reading file"<<endl;
//cout<<"words ="<<word;
while(!myfile.eof())
{
myfile>>buff;
word++;
countl=countl+strlen(buff);
}
for (int i=0;i<strlen(buff);i++)
{
if (ispunct(buff[i])) punc++;
}
cout<<"words ="<<word-1<<endl;
cout<<"Letter="<<countl-(4+punc)<<endl;
cout<<"Puncuation ="<<punc<<endl;
}
};
int main()
{
MyThread *counter [1];
for (int i = 0;i<1;i++){
counter[i] = new MyThread(i);
counter[i]->start();
}
for (int i = 0;i<1;i++){
counter[i]->wait();
}
return 0;
}
I can only get an output using one thread. I have no idea how to chop it into parts and make 4 threads read it consecutively. Please point me in the right direction.
Upvotes: 1
Views: 4848
Reputation: 3022
The following is a stream of conciousness coding of how I would approach this and is not guaranteed to be correct, work or even compile. Note the question requires an even split of work and that the way to count words is to count spaces thus saving time in reading the whole file before processing. Edit: got it to compile, seems to work
#include <future>
#include <iostream>
#include <string>
#include <boost/filesystem.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
using namespace boost::filesystem;
struct Count
{
size_t words;
size_t letters;
size_t punctuation;
Count() : words(0), letters(0), punctuation(0){};
};
Count countData(const char *start, const char *end)
{
Count count;
for (auto data = start; data < end; data++)
{
if (ispunct(*data)) {count.punctuation++;}
else if (isspace(*data)) {count.words++;}
else if (isalpha(*data)) {count.letters++;}
}
return count;
}
int main(int argc, char* argv[])
{
if (argc < 3)
{
return 1;
}
const char *filename = argv[2];
const size_t numberThreads = std::max(std::stoi(argv[1]), 1);
boost::iostreams::mapped_file_source file;
std::vector<std::future<Count>> results;
file.open(filename);
if (file.is_open())
{
const size_t fileSize = file_size(filename);
const size_t blockSize = fileSize/numberThreads;
const char *dataStart= file.data();
for (size_t i=0; i<numberThreads; i++)
{
const char *start = dataStart + i*blockSize;
const char *end = dataStart + blockSize + i*blockSize;
if (i == numberThreads-1) {end = dataStart + fileSize;}
auto result = std::async(std::launch::async, [start, end]() {
return countData(start, end);
});
results.emplace_back(std::move(result));
}
}
else
{
return 1;
}
size_t words = 0;
size_t letters = 0;
size_t punctuation = 0;
for (auto &futureResult : results)
{
auto result = futureResult.get();
words += result.words;
letters += result.letters;
punctuation += result.punctuation;
}
std::cout << "words : " << words << " letters : " << letters << " punctuation : " << punctuation << std::endl;
return 0;
}
Upvotes: 1
Reputation: 606
Should or must the document be evenly split between the threads? If they should, you'd have an easier job, get a performance gain and read theoretically arbitrary large files without having to access the file system in parallel, which would be a bad idea as it hits performance hard.
In case that you only should spread the file evenly among the threads, you could read chunks of the file in the main thread and provide an atomic offset or pointer into the chunk. Then each of the additional threads can analyze a smaller chunk at a time and update its statistics. The statistics are then merged when all threads joined.
Doing it this way allows you to read the file sequentially, giving you a performance benefit on mechanical drives and perform the work as soon as it's available without having to worry about scheduling of your threads.
In case that spreading the work among the threads evenly and not at its best is a hard requirement, you can still check the file size, divide it by the amount of threads and just terminate each thread when it has done its part even though there would still be a lot more to do. But some other thread will then be responsible to do that work.
This approach combines sequential file system read to provide the data with a map-reduce strategy to compute the result.
Another lock-free approach would be to read chunks in the main thread and provide them to each thread's work-queue using round-robin (for even workload) or checking which work-queue is the smallest and put it into there.
Upvotes: 0
Reputation: 48605
You could get the length of the file and divide that number into the number of threads.
Then, seek to each potential beginning position (using seekg()
) and adjust it by reading to the next space (std::isspace()
) to avoid chopping words in half.
Then pass each beginning and end position to a thread (the end position is the beginning position of the following partition).
Each thread then uses seekg()
to move to its assigned position and tellg()
to determine when it has reached the assigned end.
Upvotes: 2
Reputation: 4638
My strategy would be:
Find out how many words are in the file, then split that number into 4
Have each thread read 1/4 of the file. For example, if there are 80 words in the file:
Upvotes: 0