Is there a best practice to query on huge CSV in spring boot context?

Question

I'm working for well known company in a project that should bring integration with other system that are producing one csv per hour of 27Gb. The target is query these files without import em (the main problem is bureaucracy, nobody want resposibility if some data change).

Main filters on this files can be done by dates, the end-user must insert a range start-end dates. After that can be filter by few strings.

Context: spring boot microservices
Server: xeon processor 24 core 256gb Ram
Filesystem: NFS mounted from external server
Test data: 1000 files, each one 1Gb

For performance improvement i'm indexing files by date writing on each file name the range that contains and making a folder structure like yyyy/mm/dd. For each of following test the first step was make a raw file paths list that will be read.

research will read all files

Spring batch - buffered reader and parse into object: 12,097 sec
Plain java - threadpool, buffered reader and parse into object: 10,882 sec
Linux egrep with regex and parallel ran from java and parse into object: 7,701 sec

The dirtiest is also fastes. I want avoid it because security department warned me about all checks to make on input data to prevent shell injection.

Googling i found mariadb CONNECT engine that can point also huge csvs, so now i'm going on this way creating temporary table with files that research have interest, the bad part is i have to do one table for each query since dates can be different.

For first year We're expecting not more than 5 parallel researches in same time, with an average of 3 weeks of range. This queries will be done asyncronousely.

Do you know something that can help me on it? Not only for the speed but a good practice to apply. Thanks a lot folks.

Is there a best practice to query on huge CSV in spring boot context?

Answers (1)

Related Questions