Reputation: 21
Here is the problem.
I have 20 very large files, each approx 10gb, and I need to split each of the bulk files by A) criteria within the record and B) what type of bulk file it is.
Example.
Each bulk file represents an occupation. We have Lawyers, Doctors, Teachers and Programmers. Each of these bulk files contain millions of records for different individuals, not a lot of individuals, say 40 different people in total.
A record in the doctor file may look like
XJOHN 1234567 LOREMIPSUMBLABLABLA789
I would need this record from the file to be output into a file called JOHN.DOCTOR.7
John is the persons name, 7 is the last digit in the numeric sequence, and DOCTOR was the file type. I need to do this for file size limitations. Currently, I'm using perl to read the bulk files line by line and print the record into the appropriate output file. I'm opening a new handler for each record to avoid having multiple threads writing to the same handler and causing data malformations. I do have the program threaded, one thread per bulk file. I cannot install any third party applications, assume I only have whatever comes standard with RedHat Linux. I'm looking for either a Linux command that has a more efficient way of doing this or perhaps a better way that perl offers.
Thanks!
Upvotes: 0
Views: 671
Reputation: 3484
An alternate approach is to use processes instead of threads, via Parallel::ForkManager
Additionally, I would consider using a map/reduce approach by giving each process/thread its own work directory, in which it would write intermediate files, one per doctor, lawyer, etc.
I would then write a second program, the reducer, which could be a very short shell script, to concatenate the intermediate files into their respective final output files.
Upvotes: 1