Reputation: 57
We have the background job which fetches records from particular table in batch of 1000 records at a time. This job runs at a interval of 30 minutes. Now, These records have email (key) and reason (value). The issues is we have to lookup these records against data warehouse, (sort of filtering kind of thing - fetches last 180 days data from warehouse). As call to the data warehouse is very costly in terms of time (45 minutes approximately). So, existing scenario is like this. We check for flat file on disk. if it does not exists. we make a call to data warehouse, fetch the records ( size ranges from 0.2 million to 0.25 million ) and write these records on a disk. on subsequent calls we do lookup from flat file only. - Loading entire file in memory and do in-memory search and filtering. This caused OutOfMemory issue so many times.
So, we modified the logic like this. We modified the Data warehouse call in 4 equal intervals and stored each result in file again with ObjectOutputStream, Now, on subsequent calls we load data into memory in interval, i.e. 0-30 days, 31-60 days and so on. But, this also is not helping out. Can experts please suggest what should be the ideal way to tackle this issue ?. Someone in senior team suggested to use CouchDB for storing and querying the data. But, at first glance I would prefer if with existing infrastructure any good solution is available ? if not then can think of using other tools.
Filtering code as of now.
private void filterRecords(List<Record> filterRecords) {
long start = System.currentTimeMillis();
logger.error("In filter Records - start (ms) : "+start);
Map<String, String> invalidDataSet=new HashMap<String, String>(5000);
try{
boolean isValidTempResultFile = isValidTempResultFile();
//1. fetch invalid leads data from DWHS only if existing file is 7 days older.
String[] intervals = {"0","45","90","135"};
if(!isValidTempResultFile){
logger.error("#CCBLDM isValidTempResultFile false");
for(String interval : intervals){
invalidDataSet.clear();
// This call takes approx 45 minutes to fetch the data
getInvalidLeadsFromDWHS(invalidDataSet, interval, start);
filterDataSet(invalidDataSet, filterRecords);
}
}
else{
//Set data from temporary file in interval to avoid heap space issue
logger.error("#CCBLDM isValidTempResultFile true");
intervals = new String[]{"1","2","3","4"};
for(String interval : intervals){
// Here GC won't happen at immediately causing OOM issue.
invalidDataSet.clear();
// Keeps 45 days of data in memory at a time
setInvalidDataSetFromTempFile(invalidDataSet, interval);
//2. mark current record as incorrect if it belongs to invalid email list
filterDataSet(invalidDataSet, filterRecords);
}
}
}catch(Exception filterExc){
Scheduler.log.error("Exception occurred while Filtering Records", filterExc);
}finally{
invalidDataSet.clear();
invalidDataSet = null;
}
long end = System.currentTimeMillis();
logger.error("Total time taken to filter all records ::: ["+(end-start)+"] ms.");
}
Upvotes: 1
Views: 160
Reputation: 46432
I'd strongly suggest to slightly change your infrastructure. IIYC you're looking up something in a file and something in a map. Working with a file is a pain and loading everything to memory causes OOME. You can do better using a memory mapped file, which allows fast and simple access.
There's a Chronicle-Map offering a Map
interface to data stored off-heap. Actually, the data reside on the disk and take main memory as needed. You need to make your keys and values Serializable
(which AFAIK you did already) or use an alternative way (which may be faster).
It's no database, it's just a ConcurrentMap
, which makes working with it very simple. The whole installation is just adding a line like compile "net.openhft:chronicle-map:3.14.5"
to build.gradle
(or a few maven lines).
There are alternatives, but Chronicle-Map is what I've tried (just started with it, but so far everything works perfectly).
There's also Chronicle-Queue, in case you need batch processing, but I strongly doubt you'll need it as you're limited by your disk rather than main memory.
Upvotes: 2
Reputation: 2168
This is a typical use case for a batch processing kind of code. You could introduce a new column, say 'isProcessed' with 'false' value. Read say 1-100 rows (where id>=1 && id<=100) , process them and mark that flag as true. Then read say next 100 and so on. At the end of the job, mark all flags as false again (reset). But with time, it might become difficult to maintain and develop features on such a custom framework. There are open source alternatives.
Spring batch is a very good framework for these use cases and could be considered: https://docs.spring.io/spring-batch/trunk/reference/html/spring-batch-intro.html.
There is also Easy batch: https://github.com/j-easy/easy-batch which doesn't require 'Spring framework' knowledge
But if it is a huge data set and would keep on growing in future, consider moving to 'big data' tech stack
Upvotes: 2