Reputation: 4537
Problem: I've a huge raw text file (assume of 3gig), I need to go through each word in the file and find out that a word appears how many times in the file.
My Proposed Solution: Split the huge file into multiple files and each splitted file will have words in a sorted manner. For example, all the words starting with "a" will be stored in a "_a.dic" file. So, at any time we will not execeed more than 26 files.
The problem in this approach is,
I can use streams to read the file, but wanted to use threads to read certain parts of the file. For example, read 0-1024 bytes with a separate thread (atleast have 4-8 threads based on the no. of processors exist in the box). Is this is possible or am I dreaming?
Any better approach?
Note: It should be a pure c++ or c based solution. No databases etc., are allowed.
Upvotes: 5
Views: 4646
Reputation: 3451
Not C, and a bit UGLY, but it took only 2 minutes to bang out:
perl -lane '$h{$_}++ for @F; END{for $w (sort {$h{$b}<=>$h{$a} || $a cmp $b} keys %h) {print "$h{$w}\t$w"}}' file > freq
Loop over each line with -n
Split each line into @F
words with -a
Each $_
word increments hash %h
Once the END
of file
has been reached,
sort
the hash by the frequency $h{$b}<=>$h{$a}
If two frequencies are identical, sort alphabetically $a cmp $b
Print the frequency $h{$w}
and the word $w
Redirect the results to file 'freq'
I ran this code on a 3.3GB text file with 580,000,000 words.
Perl 5.22 completed in 173 seconds.
My input file already had punctuation stripped out, and uppercase converted to lowercase, using this bit of code:
perl -pe "s/[^a-zA-Z \t\n']/ /g; tr/A-Z/a-z/" file_raw > file
(runtime of 144 seconds)
The word-counting script could alternately be written in awk:
awk '{for (i=1; i<=NF; i++){h[$i]++}} END{for (w in h){printf("%s\t%s\n", h[w], w)}}' file | sort -rn > freq
Upvotes: 0
Reputation: 28127
I don't think using multiple threads that read parts of the file in parallel is going to help much. I would expect that this application is bound to the bandwidth and latency of your harddisk, not the actual word counting. Such a multi-threaded version might actually perform worse because "quasi-random" file access is typically slower than "linear file" access.
In case the CPU is really busy in a single-threaded version there might be a potential speed up. One thread could read the data in big chunks and put them into a queue of limited capacity. A bunch of other worker threads could operate each on their own chunk and count the words. After the counting worker threads finished you have to merge the word counters.
Upvotes: 6
Reputation: 180303
As others have indicated, the bottleneck will be the disk I/O. I therefore suggest that you use overlapped I/O. This basically inverts the program logic. Instead of your code tyring to determine when to do I/O, you simply tell the Operating System to call your code whenever it has finished a bit of I/O. If you use I/O completion ports, you can even tell the OS to use multiple threads for processing the file chunks.
Upvotes: 1
Reputation: 10828
First - decide on the datastructure for saving the words.
The obvious choice is the map. But perhaps a Trie would serve you better. In each node, you save the count for the word. 0 means, that it's only part of a word. You can insert into the trie using a stream and reading your file characterbased.
Second - multithreading yes or no? This one is not easy to answer. Depending on the size the datastructure grows and how you parallelize the answer may differ.
One thing you have to think about - you have to find a word boundary for each thread to start, but that should not pose a great problem (e.g. each thread walks it's start until the first word boundary and starts there, at the end each thread finishes the word it's working on).
Upvotes: 3
Reputation: 490768
While you can use a second thread to analyze the data after reading it, you're probably not going to gain a huge amount by doing so. Trying to use more than one thread to read the data will almost certainly hurt speed rather than improving it. Using multiple threads to process the data is pointless -- processing will be many times faster than reading, so even with only one extra thread, the limit is going to be the disk speed.
One (possible) way to gain significant speed is to bypass the usual iostreams -- while some are nearly as fast as using C FILE*'s, I don't know of anything that's really faster, and some are substantially slower. If you're running this on a system (e.g. Windows) that has an I/O model that's noticeably different from C's, you can gain considerably more with a little care.
The problem is fairly simple: the file you're reading is (potentially) larger than the cache space you have available -- but you won't gain anything from caching, because you're not going to reread chunks of the file again (at least if you do things sensibly). As such, you want to tell the system to bypass any caching, and just transfer data as directly as possible from the disk drive to your memory where you can process it. In a Unix-like system, that's probably open()
and read()
(and won't gain you a whole lot). On Windows, that's CreateFile
and ReadFile
, passing the FILE_FLAG_NO_BUFFERING
flag to CreateFile
-- and it'll probably roughly double your speed if you do it right.
You've also gotten some answers advocating doing the processing using various parallel constructs. I think these are fundamentally mistaken. Unless you do something horribly stupid, the time to count the words in the file will be only a few milliseconds longer than it takes to simply read the file.
The structure I'd use would be to have two buffers of, say, a megabyte apiece. Read data into one buffer. Turn that buffer over to your counting thread to count the words in that buffer. While that's happening, read data into the second buffer. When those are done, basically swap buffers and continue. There is a little bit of extra processing you'll need to do in swapping buffers to deal with a word that may cross the boundary from one buffer to the next, but it's pretty trivial (basically, if the buffer doesn't end with white space, you're still in a word when you start operating on the next buffer of data).
As long as you're sure it'll only be used on a multi-processor (multi-core) machine, using real threads is fine. If there's a chance this might ever be done on a single-core machine, you'd be somewhat better off using a single thread with overlapped I/O instead.
Upvotes: 1
Reputation: 1897
First, I'm pretty sure that C/C++ isn't the best way to handle this. Ideally, you'd use some map/reduce for parallelism, too.
But, assuming your constraints, here's what I'd do.
1) Split the text file into smaller chunks. You don't have to do this by the first-letter of the word. Just break them up into, say, 5000-word chunks. In pseudocode, you'd do something like this:
index = 0
numwords = 0
mysplitfile = openfile(index-split.txt)
while (bigfile >> word)
mysplitfile << word
numwords ++
if (numwords > 5000)
mysplitfile.close()
index++
mysplitfile = openfile(index-split.txt)
2) Use a shared map data structure and pthreads to spawn new threads to read each of the subfiles. Again, pseudocode:
maplock = create_pthread_lock()
sharedmap = std::map()
for every index-split.txt file:
spawn-new-thread(myfunction, filename, sharedmap, lock)
dump_map(sharedmap)
void myfunction(filename, sharedmap) {
localmap = std::map<string, size_t>();
file = openfile(filename)
while (file >> word)
if !localmap.contains(word)
localmap[word] = 0
localmap[word]++
acquire(lock)
for key,value in localmap
if !sharedmap.contains(key)
sharedmap[key] = 0
sharedmap[key] += value
release(lock)
}
Sorry for the syntax. I've been writing a lot of python lately.
Upvotes: 0
Reputation: 4555
What you are looking for is RegEx. This Stackoverflow thread on c++ regex engines should help:
C++: what regex library should I use?
Upvotes: 0
Reputation: 14446
stream has only one cursor. If you access to the stream with more than one thread at a time, you will not be sure to read where you want. Read is done from cursor position.
What I would do is to have only one thread (maybe the main one) that reads the stream and dispatch reading bytes to other threads.
By example:
By this way you can separate stream reading to stream analysis.
Upvotes: 0
Reputation: 755084
You need to look at 'The Practice of Programming' by Kernighan and Pike, and specifically chapter 3.
In C++, use a map based on the strings and a count (std::map<string,size_t>
, IIRC). Read the file (once - it's too big to read more than once), splitting it into words as you go (for some definition of 'word'), and incrementing the count in the map entry for each word you find.
In C, you'll have to create the map yourself. (Or find David Hanson's "C Interfaces and Implementations".)
Or you can use Perl, or Python, or Awk (all of which have associative arrays, equivalent to a map).
Upvotes: 15
Reputation: 4487
c based solution?
I think perl was born for this exact purpose.
Upvotes: 0