Reputation: 581
We have a requirement where we are reading data from three different files and doing joins among these files with different columns in the same job.
Each file size is around 25-30 GB. Our system RAM size is just 16GB. Doing joins with tmap. Talend is keeping all the reference data in physical memory. In my case, i cannot provide that much memory. Job fails due to out of memory. If i use join with temp disk option in tmap, job was dead slow.
Please help me with these questions.
Thanks
Upvotes: 1
Views: 2189
Reputation:
You can try out some changes in jobdefinition itself. Like:
-- Use Streaming -- Use Trimming for big stringdata. So transfer of unnecessary data will prevent. -- Use as connector OnSubjobOk instead OnComponentOk so the Garbage Collector has chance to freeing more data in time
Upvotes: 1
Reputation: 26
Talend process the Large amount of data very fast and in efficient manner. Its all depends on your knowledge about Talend Platforms.
Please consider the below comments as answers for your questions.
Q1.How Talend process the data larger than RAM size?
A. You can not use your entire RAM for Talend studio. Only a fraction of RAM can be used its almost half of your RAM.
For example:- With 8 GB of memory available on 64-bit system, the optimal settings can be: -vmargs
-Xms1024m
-Xmx4096m
-XX:MaxPermSize=512m
-Dfile.encoding=UTF-8
Now in your case either you have to increase your RAM with 100 GB
OR simply write the data on hard disk. For this you have to choose a Temp data directory for buffer components like- tMap, tBufferInputs, tAggregatedRow etc.
Q2. Pipeline parallelism is in place with talend? Am i missing anything in the code to accomplish that?
A. In Talend Studio, parallelization of data flows means to partition an input data flow of a Subjob into parallel processes and to simultaneously execute them, so as to gain better performance.
But this feature is available only on the condition that you have subscribed to one of the Talend Platform solutions.
When you have to develop a Job to process very huge data using Talend Studio, you can enable or disable the parallelization by one single click, and then the Studio automates the implementation across a given Job
Parallel Execution The implementation of the parallelization requires four key steps as explained as follows:
Partitioning (): In this step, the Studio splits the input records into a given number of threads.
Collecting (): In this step, the Studio collects the split threads and sends them to a given component for processing.
Departitioning (): In this step, the Studio groups the outputs of the parallel executions of the split threads.
Recollecting (): In this step, the Studio captures the grouped execution results and outputs them to a given component.
Q3. tuniq & Join operations was done in physical memory,causing the job to run dead slow. Disk option is available to handle these functionality, but it was too slow.
Q4. How performance can be improved without pushing the data to DB(ELT). Whether talend can handle huge data in millions.Need to handle this kind of data with lesser amount of RAM?
A 3&4. Here I will suggest you to insert the data directly into database using tOutputBulkExec. components and then you can apply these operation using ELT components on DB level.
Upvotes: 1