PERL - Multithreading to search a list of term in a very big text file
I've search to find an answer to my question but I don't find a solution to satisfy my need.
I've a big text file (4GB) wich is an access.log file from proxy.
I've another file which contain 7000 lines, which contain some domain address or part of url to search in my log file.
The problem is that to search my 7000 terms in the log files, it took very long time.
I wouls like to diminish this time using multithreading or something else.
But I've never programmed a such thing before :-/
could you help me to start?
Thanks in advance!
Answers (1)
Conceptually (not specific to Perl), I'd go with something like this:
- Create N threads and assign each 7000/N regexes to test.
- Preferably, N = number of available machine threads.
- It might be worth assigning more or less regexes per threads, depending on the regex complexity or length. The target is that all threads will be assigned roughly the same amount of work. This might require some heavy pre-processing pass over the regexes.
- Load a chunk of data into memory.
- You can experiment with different sizes here.
- The target is that loading this amount of data will take roughly as long as the threads will take to process it.
- Start your regex threads on the data you just loaded. In parallel, use another thread to load the next chunk of data to memory.
- Wait until all threads are finished.
- Discard the first chunk of data loaded earlier.
- Goto (3)
Advantages:
- Cache-friendly - all the threads scan the same data at the same time.
- Streaming - the size of the data you need to hold in memory is at most 2*(chunk size) at a time, making it very cheap on memory and completely agnostic to the overall data size.
- Scalable - more threads available will instantly translate to speed (as long as you adjust the chunk size appropriately).
- There's some limit here, of course. At a certain point the chunk size will be so large that it might slow down the regex threads because of poor memory locality - adding more threads beyond that point will probably only slow things down.
Also, try to make each thread maintain its own matches and don't synchronize them into the same location - that can create a race condition. If you need to synchronize the threads, do it between steps (4) and (5) above.
Unfortunately my Perl is very rusty, but until you get a better answer I'm going to post this in the hopes that it will be useful.