Reputation: 1
I am currently working on a Batch that consumes data from a large SQL database with millions of rows.
It does some processing in the processor that consists of grouping rows retrieved from the Reader via a large sql query with joins.
And the Writer writes the result to another table.
The problem is that this Batch has performance problems, because the Sql selection queries take a lot of time and the steps are not executed in multithreading.
So I'd like to run them in multitheading but the problem is that the steps group the rows by calculating a total amount of all the rows with the same types for example.
So if I put it in multitheading how can I do that when each partition is going to be processed in a different thread knowing that it's millions of rows that I can't store in the context to retrieve them after the step and do the grouping. and I can't save them in the database either since it's millions of rows Do you have any idea how I can do this? I hope I was able to explain my problem well. And thanks in advance for your help
Upvotes: 0
Views: 4892
Reputation: 97
We had a similar use case where I had to start off with reading millions of records based on certain criteria as input from a rest endpoint and process it parallelly using 20-30 threads to meet extreme deadlines. But subsequent challenges were that same complex queries were made to database and then partitioned to be shared across generated threads.
A typical batch process would have the objective -> to read, make some http calls/ manipulate the data, and write it to a response log table.
Spring batch provides the capability to keep track of the records processed so that a restart can be initiated to pick up the remaining lot to process. An alternative to this can be a flag in your master table to mark the record as processed so it need not be picked during restart.
Multiple challenges faced were :
Lets say you have 10000 records and you need to process 5 records parallelly.
Multiple creative solutions can be implemented but the two most often used that fit all use cases would be
Considering the memory the machine will be able to serve, a suitable number of threads can be selected. Eg 5. => 10000/5 => each thread would process 2000 records.
Partitioning is a processing to split the ranges and allowing each step execution process to pick it in its own thread and run it. For the above step we will need to split those ranges and pass it while query execution to make it fetch records for the range and continue the process in a separate thread.
Thread 0 : 1–2000
Thread 1 : 2001–4000
Thread 2 : 4001–6000
Thread 3 : 6001–8000
Thread 4 : 8001–10000
Another logic for partitioning would be assigning the threads 0 to 4 and query basis the modulo of the number. But one drawback of this could be that one particular range would receive more load compared to others whereas the previous approach would ensure that everyone gets a fair share.
The split data is passed on to the separate thread which will start processing it and write data at the commit interval ( chunk size ) mentioned in the step.
Code :
@Bean
@StepScope
public JdbcPagingItemReader<YourDataType> dataReaders(
@Value("#{jobParameters[param1]}") final String param1,
@Value("#{stepExecutionContext['modulo']}") Long modulo) throws Exception {
logger.info("Thread started reading for modulo index : " + modulo);
JdbcPagingItemReader<YourDataType> reader = new JdbcPagingItemReader <> ();
reader.setDataSource(getDataSource());
reader.setRowMapper(new YourDataTypeRowMapper());
reader.setQueryProvider(queryProvider(param1, modulo));
return reader;
public OraclePagingQueryProvider queryProvider(String param1, Long modulo) throws Exception {
OraclePagingQueryProvider provider = new OraclePagingQueryProvider();
provider.setSelectclause("your elements to query");
provider.setFromClause("your tables/ joined tables");
provider.setWhereclause("where clauses AND MOD (TO_NUMBER(yourkey) = " + modulo);
Map<String,Order> sortkeys = new HashMap<>();
sortKeys.put("yoursortkey", Order.ASCENDING);
provider.setSortKeys(sortKeys);
return provider;
}
Sample data reader -> param1 is any parameter that user would want to input. modulo is a step execution parameter — passed from the Partitioner object.
Paritioner object if to be used for modulo 5 would have modulo 0|1|2|3|4 and this would spawn 5 threads which would interact with the reader and fetch data for the divided sets.
@Bean
public JdbcbatchItemWriter<YourDataType> dataWriter() throws Exception {
logger.info("Initializing data writer");
JdbcBatchItemWriter<YourDataType> databaseItemWriter = new JdbcBatchItemWriter<>();
databaseItemWriter.setDataSource(injectyourdatasourcehere);
databaseItemWriter.setsql(INSERT_QUERY_HERE);
ItemPreparedStatementsetter<RespData> ps = new YourResponsePreparedStatement();
databaseItemWriter.setItemPreparedStatementsetter(ps);
return databaseItemWriter;
}
public class Your ResponsePreparedStatement implements ItemPreparedStatementSetter<RespData> {
public void setValues (RespData respData, PreparedStatement preparedStatement)throws SQLException {
preparedStatement.setString(1, respData.getYourData());
}
}
Response Writer to log response to any table to keep tab of the processed data for analytics or business reporting.
@Bean
public ItemProcessor<YourDataType,RespData> processor() {
return new YOURProcessor();
}
Processor where the core logic for the data manipulation would be written. Response returned is of the type which is expected by the Data writer.
@Configuration
@EnableAutoConfiguration
@EnableBatchProcessing
public class BatchConfiguration extends DefaultBatchConfigurer {
@Override
public void setDataSource(DataSource dataSource) {}
}
else such an exception could be encountered:
at java.lang.Thread.run(Thread.java:829) [?:?]Caused by: org.springframework.dao.CannotSerializeTransactionException: PreparedStatementCallback; SQL [INSERT into BATCH_JOB_INSTANCE(JOB_INSTANCE_ID, JOB_NAME, JOB_KEY, VERSION) values (?, ?, ?, ?)]; ORA-08177: can’t serialize access for this transaction ; nested exception is java.sql.SQLException: ORA-08177: can’t serialize access for this transaction
Column Range partitioner can be created as:
@Component
public class ColumnRangePartitioner implements Partitioner {
Map<String,ExecutionContext> result = new HashMap();
@Override
public Map<String,ExecutionContext> partition(int gridsize) {
Map<String,ExecutionContext> result = new HashMap<>();
int start = 0;
while (start < gridSize) {
ExecutionContext value = new ExecutionContext();
result.put("partition : " + start, value);
value.putInt("modulo", start);
start += 1;
}
return result;
}
}
our job will be focusing on executing step1 — which will spawn threads based on the partitioner provided — here columnrange partitioner — to process the step.
Grid size is the no of parallel threads ( modulo to be calculated of using ).
Every processStep step is a series of reading the data for that specific thread assigned modulo, processing it and then writing it.
@Bean
public ColumnRangePartitioner getParitioner () throws Exception {
ColumnRangePartitioner columnRangePartitioner = new ColumnRangePartitioner();
return columnRangePartitioner;
}
@Bean
public Step step1(@Qualifier("processStep") Step processStep,
StepBuilderFactory stepBuilderFactory) throws Exception {
return stepBuilderFactory.get("step1")
.listener(jobCompletionNotifier)
.partitioner(processStep.getName(),getParitioner())
.step(processStep)
.gridSize(parallelThreads)
.taskExecutor(taskExecutor())
.build();
}
@Bean
public Step processStep(
@Qualifier("DataReader") ItemReader<ReadType> reader,
@Qualifier("LogWRITE") ItemWriter<WriterType> writer,
StepBuilderFactory stepBuilderFactory) throws Exception {
return stepBuilderFactory.get("processStep")
.<ReadType,WriterType> chunk(1)
.reader(reader)
.processor(processor())
.writer (writer)
.faultTolerant()
.skip(Exception.class)
.skipLimit(exceptionLimit)
.build();
}
@Bean
public SimpleAsyncTaskExecutor taskExecutor() {
SimpleAsyncTaskExecutor asyncTaskExecutor = new SimpleAsyncTaskExecutor();
return asyncTaskExecutor;
}
@Bean
public Job our JOB (@Qualifier("step1") Step step1, JobBuilderFactory jobBuilderFactory) throws Exception {
return jobBuilderFactory.get("ourjob")
.start(step1)
.incrementer(new RunIdIncrementer())
.preventRestart()
.build();
}
This might be an usual spring batch solution but would be applicable to every migration requirement involving commonly used SQL DB/ java based solutions.
Avoid executing the join query again and then filtering. complex joins could impact database performance. Hence a better solution would be to fetch the data once and split it internally. Memory used by the application would be huge and the hashmap would be populated with all the data your query would fetch but java is capable of handling that. That fetched data could be passed to the ListItemReader to process list of data for that particular thread parallelly.
For processing parallel requests ( not threads but parallel api calls to this application ) modification can be made to process a certain query once only keeping a lock on it using a semaphore so that other threads are waiting on it. Once lock is release those waiting threads would find that data to be present and db will not be queries again.
The code would for the above impl would be complex for this blog scope. Feel free to ask if any use case is required by your application.
Would love to solve any issues regarding the same. Feel free to reach out to me(Akshay) at [email protected] or my colleague(Sagar) at [email protected]
Upvotes: 0
Reputation: 1082
I've had a similar task like yours, unlikly we were using java 1.7 and spring 3.x. I can provide a configuiration in xml so maybe you will be able to use annotation configuration for this I've not tryed.
<batch:job id="dualAgeRestrictionJob">
<-- use a listner if you need -->
<batch:listeners>
<batch:listener ref="dualAgeRestrictionJobListener" />
</batch:listeners>
<!-- master step, 10 threads (grid-size) -->
<batch:step id="dualMasterStep">
<partition step="dualSlaveStep"
partitioner="arInputRangePartitioner">
<handler grid-size="${AR_GRID_SIZE}" task-executor="taskExecutor" />
</partition>
</batch:step>
</batch:job>
<-- here you define your reader processor and writer and the commit interval -->
<batch:step id="dualSlaveStep">
<batch:tasklet transaction-manager="transactionManager">
<batch:chunk reader="arInputPagingItemReader"
writer="arOutputWriter" processor="arInputItemProcessor"
commit-interval="${AR_COMMIT_INTERVAL}" />
</batch:tasklet>
</batch:step>
<!-- The partitioner -->
<bean id="arInputRangePartitioner" class="com.example.ArInputRangePartitioner">
<property name="arInputDao" ref="arInputJDBCTemplate" />
<property name="statsForMail" ref="statsForMail" />
</bean>
<bean id="taskExecutor"
class="org.springframework.scheduling.concurrent.ThreadPoolTaskExecutor">
<property name="corePoolSize" value="${AR_CORE_POOL_SIZE}" />
<property name="maxPoolSize" value="${AR_MAX_POOL_SIZE}" />
<property name="allowCoreThreadTimeOut" value="${AR_ALLOW_CORE_THREAD_TIME_OUT}" />
</bean>
<bean id="transactionManager"
class="org.springframework.jdbc.datasource.DataSourceTransactionManager">
<property name="dataSource" ref="kvrDatasource" />
</bean>
The partitioner makes a query to count the rows and make chunks for each thread:
public class ArInputRangePartitioner implements Partitioner {
private static final Logger logger = LoggerFactory.getLogger(ArInputRangePartitioner.class);
private ArInputDao arInputDao;
private StatsForMail statsForMail;
@Override
public Map<String, ExecutionContext> partition(int gridSize) {
Map<String, ExecutionContext> result = new HashMap<String, ExecutionContext>();
// You can make a query and then divede the from to for each thread
Map<Integer,Integer> idMap = arInputDao.getOrderIdList();
Integer countRow = idMap.size();
statsForMail.setNumberOfRecords( countRow );
Integer range = countRow / gridSize;
Integer remains = countRow % gridSize;
int fromId = 1;
int toId = range;
for (int i = 1; i <= gridSize; i++) {
ExecutionContext value = new ExecutionContext();
if(i == gridSize) {
toId += remains;
}
logger.info("\nStarting : Thread {}", i);
logger.info("fromId : {}", idMap.get(fromId) );
logger.info("toId : {}", idMap.get(toId) );
value.putInt("fromId", idMap.get(fromId) );
value.putInt("toId", idMap.get(toId) );
value.putString("name", "Thread" + i);
result.put("partition" + i, value);
fromId = toId + 1;
toId += range;
}
return result;
}
public ArInputDao getArInputDao() {
return arInputDao;
}
public void setArInputDao(ArInputDao arInputDao) {
this.arInputDao = arInputDao;
}
public StatsForMail getStatsForMail() {
return statsForMail;
}
public void setStatsForMail(StatsForMail statsForMail) {
this.statsForMail = statsForMail;
}
}
This is the configuration for the reader and writer:
<bean id="arInputPagingItemReader" class="org.springframework.batch.item.database.JdbcPagingItemReader" scope="step" >
<property name="dataSource" ref="kvrDatasource" />
<property name="queryProvider">
<bean class="org.springframework.batch.item.database.support.SqlPagingQueryProviderFactoryBean" >
<property name="dataSource" ref="kvrDatasource" />
<property name="selectClause" value="${AR_INPUT_PAGING_ITEM_READER_SELECT}" />
<property name="fromClause" value="${AR_INPUT_PAGING_ITEM_READER_FROM}" /> <property name="whereClause" value="${AR_INPUT_PAGING_ITEM_READER_WHERE}" />
<property name="sortKey" value="${AR_INPUT_PAGING_ITEM_READER_SORT}" />
</bean>
</property>
<!-- Inject via the ExecutionContext in rangePartitioner -->
<property name="parameterValues">
<map>
<entry key="fromId" value="#{stepExecutionContext[fromId]}" />
<entry key="toId" value="#{stepExecutionContext[toId]}" />
</map>
</property>
<property name="pageSize" value="${AR_PAGE_SIZE}" />
<property name="rowMapper" ref="arOutInRowMapper" />
</bean>
<bean id="arOutputWriter"
class="org.springframework.batch.item.database.JdbcBatchItemWriter"
scope="step">
<property name="dataSource" ref="kvrDatasource" />
<property name="sql" value="${SQL_AR_OUTPUT_INSERT}"/>
<property name="itemSqlParameterSourceProvider">
<bean class="org.springframework.batch.item.database.BeanPropertyItemSqlParameterSourceProvider" />
</property>
</bean>
Maybe some one knows how to convert this with modern spring-batch/spring-boot
PS: Don't use a lot of thread otherwise spring batch will lose a lot of time to fill it's own tables. You have to make some benchmark to understand the correct configuration
I also suggest to not use jpa/hibernate with millions of rows, in my case I’ve used jdbcTemplate
EDIT for annotation configuration see this question
Follow an example of configuration with partitioner
@Configuration
@RequiredArgsConstructor
public class JobConfig {
private static final Logger log = LoggerFactory.getLogger(JobConfig.class);
private final JobBuilderFactory jobBuilderFactory;
private final StepBuilderFactory stepBuilderFactory;
@Value(value = "classpath:employees.csv")
private Resource resource;
@Bean("MyJob1")
public Job createJob(@Qualifier("MyStep1") Step stepMaster) {
return jobBuilderFactory.get("MyJob1")
.incrementer(new RunIdIncrementer())
.start(stepMaster)
.build();
}
@Bean("MyStep1")
public Step step(PartitionHandler partitionHandler, Partitioner partitioner) {
return stepBuilderFactory.get("MyStep1")
.partitioner("slaveStep", partitioner)
.partitionHandler(partitionHandler)
.build();
}
@Bean("slaveStep")
public Step slaveStep(FlatFileItemReader<Employee> reader) {
return stepBuilderFactory.get("slaveStep")
.<Employee, Employee>chunk(1)
.reader(reader)
.processor((ItemProcessor<Employee, Employee>) employee -> {
System.out.printf("Processed item %s%n", employee.getId());
return employee;
})
.writer(list -> {
for (Employee item : list) {
System.out.println(item);
}
})
.build();
}
@Bean
public Partitioner partitioner() {
return gridSize -> {
Map<String, ExecutionContext> result = new HashMap<>();
int lines = 0;
try(BufferedReader reader = new BufferedReader(new InputStreamReader(resource.getInputStream()))) {
while (reader.readLine() != null) lines++;
} catch (IOException e) {
throw new RuntimeException(e);
}
int range = lines / gridSize;
int remains = lines % gridSize;
int fromLine = 0;
int toLine = range;
for (int i = 1; i <= gridSize; i++) {
if(i == gridSize) {
toLine += remains;
}
ExecutionContext value = new ExecutionContext();
value.putInt("fromLine", fromLine);
value.putInt("toLine", toLine);
fromLine = toLine;
toLine += range;
result.put("partition" + i, value);
}
return result;
};
}
@StepScope
@Bean
public FlatFileItemReader<Employee> flatFileItemReader(@Value("#{stepExecutionContext['fromLine']}") int startLine, @Value("#{stepExecutionContext['toLine']}") int lastLine) {
FlatFileItemReader<Employee> reader = new FlatFileItemReader<>();
reader.setResource(resource);
DefaultLineMapper<Employee> lineMapper = new DefaultLineMapper<>();
lineMapper.setFieldSetMapper(fieldSet -> {
String[] values = fieldSet.getValues();
return Employee.builder()
.id(Integer.parseInt(values[0]))
.firstName(values[1])
.build();
});
lineMapper.setLineTokenizer(new DelimitedLineTokenizer(";"));
reader.setLineMapper(lineMapper);
reader.setCurrentItemCount(startLine);
reader.setMaxItemCount(lastLine);
return reader;
}
@Bean
public PartitionHandler partitionHandler(@Qualifier("slaveStep") Step step, TaskExecutor taskExecutor) {
TaskExecutorPartitionHandler taskExecutorPartitionHandler = new TaskExecutorPartitionHandler();
taskExecutorPartitionHandler.setTaskExecutor(taskExecutor);
taskExecutorPartitionHandler.setStep(step);
taskExecutorPartitionHandler.setGridSize(5);
return taskExecutorPartitionHandler;
}
@Bean
public TaskExecutor taskExecutor() {
ThreadPoolTaskExecutor taskExecutor = new ThreadPoolTaskExecutor();
taskExecutor.setMaxPoolSize(5);
taskExecutor.setCorePoolSize(5);
taskExecutor.setQueueCapacity(5);
taskExecutor.afterPropertiesSet();
return taskExecutor;
}
}
Upvotes: 1